kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
222 stars 24 forks source link

Read bam with split reads #259

Open lauradmartens opened 6 months ago

lauradmartens commented 6 months ago

Hi, first of all thanks a lot for this great tool!

I was wondering if it would be possible to have the make fragment files function from bam files also consider split reads in addition to clipped reads? Do you think this would be difficult to implement?

Thanks a lot for your help!

kaizhang commented 6 months ago

I'm not sure how to treat split reads. Suppose we have split reads as follows:

Read1: chr1, 200, 400 Read2: chr1, 500, 600 Split read: chr1, 2000, 2100

What are the fragments supposed to be generated here?

image

lauradmartens commented 6 months ago

That's a good point! I guess I thought of it rather in the context of RNA coverage rather than ATAC because it would be nice to also use your storage format + functionality for that.

manzt commented 6 months ago

Hi there. I worked with @lauradmartens on this the other day and we came up with a proof of concept https://github.com/kaizhang/SnapATAC2/compare/main...manzt:SnapATAC2:split-reads

This seems to work well for our use case: generating fragments .bed for the RNA coverage, and then loading the fragments in AnnData with import_fragments. Props to the API design that this just works! Would you have any interest in supporting in extending the current API to support our use case? We would like to contribute if possible, since having this functionality in snapatac2 (python) would be great for some of our collaborators.

It seems specialized enough that I'm not sure whether adding a flag to make_fragment_file (e.g., expand_split_reads) or introducing a specialized function (e.g., make_rna_coverage_fragment_file) would be desirable. Thanks for your consideration!

kaizhang commented 6 months ago

Thanks! I'll take a look at your implementation this weekend, and decide how to proceed.

kaizhang commented 6 months ago

The approach you are taking works when the reads are single-ended and your goal is to generate bigwig coverage file. To make it works for paired-end reads as well and to be compatible with other functions, I'm thinking to store these reads in a new slot in obsm, e.g., 'split_reads'. The downstream functions can then choose to use it or not.