greenelab / 2022-microberna

A pipeline to generate a compendia of bacterial and archaeal RNA-seq data
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

investigate whether salmon k-mer size is parameterized #2

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

k = 21 ~= genus k = 31 ~= species k = 51 ~= strain

If we do best RNA-seq sample <-> reference matching at k = 31, check if decreasing salmon k-mer size (if it's parameterized) increases mapping rates for RNAseq samples that are distantly related to their reference, without increasing off-target mappings for RNAseq samples that are closely related to their reference.

May be able to flexibly select k-mer sizes based on % of RNAseq sample that was identified in the database via gather, although let's hope that that is unnecessary as it strikes me as difficult and difficult to communicate.

taylorreiter commented 2 years ago

it is parameterized! Need to evaluation (see section in readme)

taylorreiter commented 2 years ago

idea: make synthetic reads and test the salmon k parameter. probably easiest to take a GTDB rep genome, cut out it's protein sequences, introduce variation, and simulate reads for different levels of variation. Then test at different k sizes.