timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
112 stars 24 forks source link

filterbarcodes drops unmapped reads #55

Closed joycekang closed 1 year ago

joycekang commented 1 year ago

Hi Tim,

Thank you so much for working on this tool! We have been using the filterbarcodes function to demultiplex bam files (to rerun the read alignment step, starting from the output bam from CellRanger). For our particular application, we are interested in unmapped reads, but unfortunately the filterbarcodes function seems to not carry over the unmapped reads after demultiplexing. (ie. the input bam file has unmapped reads, but the resulting demultiplexed bams have 0 unmapped reads). Is it possible to tweak the function to have an option to keep unmapped reads? We would greatly appreciate it!

Thanks again, Joyce

timoast commented 1 year ago

Hi @joycekang, are you sure there are reads with a matching cell barcode that are unmapped? I tried with a cellranger bam file and can see unmapped reads in the split bam file after running filterbarcodes

joycekang commented 1 year ago

Hi @timoast, I do think so… for example, here are the first few lines of the barcode mapping file I'm using:

image

If we just take the top barcode, I can grep for it and find it matches unmapped reads in the multiplexed bam file (it has the unmapped flag of 4): image

After I run filterbarcodes, the file for the individual in this example HMN83554_flu has no unmapped reads. image

It's encouraging that it works for you though - Could it be that sinto may have been updated at some point to handle unmapped reads?

joycekang commented 1 year ago

I'm not sure if this is helpful, but this post seems to say that pysam's bam.fetch() function might need an until_eof=True to grab unmapped reads?

https://www.biostars.org/p/361215/

https://pysam.readthedocs.io/en/latest/faq.html#alignmentfile-fetch-does-not-show-unmapped-reads

timoast commented 1 year ago

Ok thanks, I didn't know that. Also because of the way we parallelize by splitting into genomic regions, that might be causing unmapped reads to get skipped as well. I'll look into it some more and let you know

timoast commented 1 year ago

Ok I think this should be fixed now, could you install from the develop branch and let me know if it's working for you?

joycekang commented 1 year ago

Hi Tim,

It seems to be working great - I'm now able to see unmapped reads. Thanks so much for the speedy fix, really appreciate it!

Joyce