pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 29 forks source link

how to implement EM using kb? #51

Open jc271828 opened 1 year ago

jc271828 commented 1 year ago

Hi,

I was wondering how/if I can choose EM algorithm or the "simpler" multimapping option that distributes reads evenly across genes when using kb to count reads. And because my experiment was done using 10x Genomics technology (grabbing sequences adjacent to the polyA tail), are reads supposedly very 3' end biased? If so, I also wonder if the EM algorithm can accurately distribute reads that are mapped to Gene A's 3' end and Gene B's 5' end. As far as I'm imagining it, those reads are more "likely" from Gene A transcripts? Thanks for your time!

Jingxian

Yenaled commented 1 year ago

Yeah, you can choose those options (see kb count which supports both). I haven't really seen a benefit for that though (with everything being 3' end, you can't really resolve ambiguities like you can with bulk data).

As for your question about the EM algorithm, no, that is not supported. There are many things to consider in order for such a model to work (internal polyA tracts, mapping location distribution and modeling fragments, etc.) and we're unsure of how much value we'd actually gain from fitting such models. We hope to look into it at some point though

jc271828 commented 1 year ago

Thank you for such a timely response! That makes sense. I guess how much benefit can be gained from developing a better-fitting model may partially depend on how "overlapping"/"adjacent" genes are in the reference genome. I'm working with C. elegans and the current version annotation I'm working with has like over 10% genes overlapping. Hmm.. so I guess I'll probably not worry about this too much for now but really look forward to seeing future workarounds on this!