timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
112 stars 24 forks source link

Regarding utils::chunk_bam() #40

Closed hukai916 closed 2 years ago

hukai916 commented 2 years ago

Hi developers,

I understand that the chunk_bam() function splits the genome into multiple intervals for multiprocessing.

Basically, for each paralleled task, it calls pysam.fetch() to retrieve all the reads that map to the supplied interval. One concern to me is that, if certain reads overlap with more than one "intervals" (thus, will be fetched by pysam more than once from parallel jobs), will those reads be double counted?

Please let me know if this is a valid concern or not based on your experience. Really appreciate it!

timoast commented 2 years ago

Do you mean for the fragments function? In fragments we don't use utils::chunk_bam() for this reason, we only separate reads based on chromosome for multiprocessing

hukai916 commented 2 years ago

Good to know, that explains my concern. Thanks.