Open taylorreiter opened 2 years ago
it is parameterized! Need to evaluation (see section in readme)
idea: make synthetic reads and test the salmon k parameter. probably easiest to take a GTDB rep genome, cut out it's protein sequences, introduce variation, and simulate reads for different levels of variation. Then test at different k sizes.
k = 21 ~= genus k = 31 ~= species k = 51 ~= strain
If we do best RNA-seq sample <-> reference matching at k = 31, check if decreasing salmon k-mer size (if it's parameterized) increases mapping rates for RNAseq samples that are distantly related to their reference, without increasing off-target mappings for RNAseq samples that are closely related to their reference.
May be able to flexibly select k-mer sizes based on % of RNAseq sample that was identified in the database via gather, although let's hope that that is unnecessary as it strikes me as difficult and difficult to communicate.