Application of SnapATAC2 on Paired-tag data

yoshihiko1218 commented 2 years ago

Hi Dr.Zhang,

I am working on paired-tag data and as recommended, trying to analyze paired-tag DNA sequencing data using snapATAC. I wonder how I can load the results of paired-tag data("barcodes.tsv" barcode of cells,"genes.tsv" annotation of bins,"matrix.mtx" cell times bin matrix) into snapATAC?

Thank you so much!

kaizhang commented 2 years ago

I just added a function called read_10x_mtx to SnapATAC2 for this purpose. Please read the description here: https://kzhang.org/SnapATAC2/_autosummary/snapatac2.read_10x_mtx.html#snapatac2.read_10x_mtx.

To use this feature, you have to install the developmental version as I haven't uploaded the latest changes to Pypi. I will probably do so later this month.

Alternatively, you may be able to use the scanpy library to read the 10x mtx file and save it as a .h5ad file, which can then be opened in SnapATAC2.

yoshihiko1218 commented 2 years ago

Thank you so much!!!!!

yoshihiko1218 commented 2 years ago

Hi, I wonder how the next step should be done. Since when I try to run "filter_cells", I get error saying there is no fragment information, I wonder if I should calculate those information. Is there any function to calculate those, or I should just manually add up whole columns for each cell and manually add results to obs?

yoshihiko1218 commented 2 years ago

Also, when I try to run DNA = snap.create_dataset(dataset, storage = "DNA_Combined.h5ads") to combined multiple h5ad files generated with read_10x_mtx, I got error thread '<unnamed>' panicked at 'called 'Result::unwrap()' on an 'Err' value: var names mismatch', /home/jmj7858/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/27cedd1/pyanndata/src/anndata.rs:727:76 Traceback (most recent call last): File "<stdin>", line 1, in <module> pyo3_runtime.PanicException: called 'Result::unwrap()' on an 'Err' value: var names mismatch What I might miss?

kaizhang commented 2 years ago

Also, when I try to run DNA = snap.create_dataset(dataset, storage = "DNA_Combined.h5ads") to combined multiple h5ad files generated with read_10x_mtx, I got error thread '<unnamed>' panicked at 'called 'Result::unwrap()' on an 'Err' value: var names mismatch', /home/jmj7858/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/27cedd1/pyanndata/src/anndata.rs:727:76 Traceback (most recent call last): File "<stdin>", line 1, in <module> pyo3_runtime.PanicException: called 'Result::unwrap()' on an 'Err' value: var names mismatch What I might miss?

Please check whether every h5ad file has the same var_names.

kaizhang commented 2 years ago

Hi, I wonder how the next step should be done. Since when I try to run "filter_cells", I get error saying there is no fragment information, I wonder if I should calculate those information. Is there any function to calculate those, or I should just manually add up whole columns for each cell and manually add results to obs?

You won't be able to use filter_cells if you import the data from mtx files. The reason is that there is no QC metric available. QC metrics are computed during the fragment file importing procedure.

filter_cells uses two QC metrics to filter the cells: TSS enrichment and the number of reads. You can calculate the latter by yourself. For TSS enrichment, it may not be suitable for paired-tag data. We typically do not use TSSe to filter the cells in paired-Tag analysis. We are still trying to figure out the alternatives.

I also recommend you generate a fragment file for paired-tag data and start from there, since most downstream analyses will need this information. I know paired-tag doesn't give you fragments, but you can use single-ended insertion in this case. Here is an example:

chr1    3000132 3000233 fn:42:12:06     35      -
chr1    3000164 3000232 bu:0R:0F:08     36      -
chr1    3000227 3000285 eu:93:55:05     40      +
chr1    3000228 3000278 eu:93:55:05     38      +
chr1    3000228 3000329 ek:90:48:06     42      +
chr1    3000584 3000623 aw:77:83:05     42      -
chr1    3000612 3000649 av:L0:U1:06     40      -

1st column: chromosome 2nd column: start 3rd column: end 4th column: name of barcode 5th column: duplication count 6th column: strand

Note the 6th column doesn't exist in normal fragment files. It indicates the data is single-ended and tells the program to handle such data differently.

yoshihiko1218 commented 2 years ago

Yes. I realized that they all have different var_names. I am not sure why it's happening but gonna check back preprocessing steps. You mean manually generating fragment files using matrix right? For example:

chr10 10000000 10005000 01:20:05 0 +
chr10 10000000 10005000 01:21:02 0 +  
chr10 10000000 10005000 01:24:03 5 +  
chr10 10000000 10005000 01:27:01 10 +  
chr10 100000000 100005000 01:20:05 3 +  
chr10 100000000 100005000 01:21:02 16 +  
chr10 100000000 100005000 01:24:03 0 +  
chr10 100000000 100005000 01:27:01 0 +

In that case, how should I determine the 6th? Like, for this position, between + and -, which one should I use?

Also, If I do preprocessing step by making fragment files instead of using read_10x_mtx, will inconsistent var_names still be the problem? I mapped them to the same reference and separated them to the same 5k bins, so I am not sure why var_names differed. But if it is still a problem, I will go back to check why this is happening.

kaizhang commented 2 years ago

Fragment files need to be generated from bam files. Once you have the fragment files, you can use snapatac2 to generate the cell by bin matrix and streamline the downstream analyses.

Currently snapatac2 cannot convert bam file to fragment file. I will work on that this weekend.

kaizhang commented 2 years ago

I've added a function for this: https://kzhang.org/SnapATAC2/_autosummary/snapatac2.pp.make_fragment_file.html.

It is usable but not quite finished yet. You can track the progress here: #1 . Here is an example usage (for Paired-Tag data):

import snapatac2 as snap
snap.pp.make_fragment_file(
    "DNA_H3K4me3.bam",
    "output.bed.gz",
    is_paired = False,
    barcode_regex = "(..:..:..:..):\w+$",
    umi_regex = ":(\w+$)",
)

Note:

Current implementation will use quite a lot of memory. This will be fixed once it switchs to out-of-core sorting.
It will mark some reads as duplicates even if the input has been deduplicated by reach-tools. Because the deduplication algorithm is more stringent than reach-tools'. make_fragment_file uses the same algorithm as samtools markdup and picard do.

yoshihiko1218 commented 1 year ago

Thank you so much! I will check that.

qijt123 commented 1 year ago

@yoshihiko1218 Have you generated a fragment file successfully? Because I am also dealing with Paired-tag data, I have the same problem as you.

kaizhang / SnapATAC2

Application of SnapATAC2 on Paired-tag data #66