Open yoshihiko1218 opened 2 years ago
I just added a function called read_10x_mtx
to SnapATAC2 for this purpose. Please read the description here: https://kzhang.org/SnapATAC2/_autosummary/snapatac2.read_10x_mtx.html#snapatac2.read_10x_mtx.
To use this feature, you have to install the developmental version as I haven't uploaded the latest changes to Pypi. I will probably do so later this month.
Alternatively, you may be able to use the scanpy library to read the 10x mtx file and save it as a .h5ad file, which can then be opened in SnapATAC2.
Thank you so much!!!!!
Hi, I wonder how the next step should be done. Since when I try to run "filter_cells", I get error saying there is no fragment information, I wonder if I should calculate those information. Is there any function to calculate those, or I should just manually add up whole columns for each cell and manually add results to obs?
Also, when I try to run DNA = snap.create_dataset(dataset, storage = "DNA_Combined.h5ads")
to combined multiple h5ad files generated with read_10x_mtx, I got error thread '<unnamed>' panicked at 'called 'Result::unwrap()' on an 'Err' value: var names mismatch', /home/jmj7858/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/27cedd1/pyanndata/src/anndata.rs:727:76 Traceback (most recent call last): File "<stdin>", line 1, in <module> pyo3_runtime.PanicException: called 'Result::unwrap()' on an 'Err' value: var names mismatch
What I might miss?
Also, when I try to run
DNA = snap.create_dataset(dataset, storage = "DNA_Combined.h5ads")
to combined multiple h5ad files generated with read_10x_mtx, I got errorthread '<unnamed>' panicked at 'called 'Result::unwrap()' on an 'Err' value: var names mismatch', /home/jmj7858/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/27cedd1/pyanndata/src/anndata.rs:727:76 Traceback (most recent call last): File "<stdin>", line 1, in <module> pyo3_runtime.PanicException: called 'Result::unwrap()' on an 'Err' value: var names mismatch
What I might miss?
Please check whether every h5ad file has the same var_names
.
Hi, I wonder how the next step should be done. Since when I try to run "filter_cells", I get error saying there is no fragment information, I wonder if I should calculate those information. Is there any function to calculate those, or I should just manually add up whole columns for each cell and manually add results to obs?
You won't be able to use filter_cells
if you import the data from mtx files. The reason is that there is no QC metric available. QC metrics are computed during the fragment file importing procedure.
filter_cells
uses two QC metrics to filter the cells: TSS enrichment and the number of reads. You can calculate the latter by yourself. For TSS enrichment, it may not be suitable for paired-tag data. We typically do not use TSSe to filter the cells in paired-Tag analysis. We are still trying to figure out the alternatives.
I also recommend you generate a fragment file for paired-tag data and start from there, since most downstream analyses will need this information. I know paired-tag doesn't give you fragments, but you can use single-ended insertion in this case. Here is an example:
chr1 3000132 3000233 fn:42:12:06 35 -
chr1 3000164 3000232 bu:0R:0F:08 36 -
chr1 3000227 3000285 eu:93:55:05 40 +
chr1 3000228 3000278 eu:93:55:05 38 +
chr1 3000228 3000329 ek:90:48:06 42 +
chr1 3000584 3000623 aw:77:83:05 42 -
chr1 3000612 3000649 av:L0:U1:06 40 -
1st column: chromosome 2nd column: start 3rd column: end 4th column: name of barcode 5th column: duplication count 6th column: strand
Note the 6th column doesn't exist in normal fragment files. It indicates the data is single-ended and tells the program to handle such data differently.
Yes. I realized that they all have different var_names
. I am not sure why it's happening but gonna check back preprocessing steps.
You mean manually generating fragment files using matrix right? For example:
chr10 10000000 10005000 01:20:05 0 +
chr10 10000000 10005000 01:21:02 0 +
chr10 10000000 10005000 01:24:03 5 +
chr10 10000000 10005000 01:27:01 10 +
chr10 100000000 100005000 01:20:05 3 +
chr10 100000000 100005000 01:21:02 16 +
chr10 100000000 100005000 01:24:03 0 +
chr10 100000000 100005000 01:27:01 0 +
In that case, how should I determine the 6th? Like, for this position, between + and -, which one should I use?
Also, If I do preprocessing step by making fragment files instead of using read_10x_mtx
, will inconsistent var_names still be the problem? I mapped them to the same reference and separated them to the same 5k bins, so I am not sure why var_names differed. But if it is still a problem, I will go back to check why this is happening.
Fragment files need to be generated from bam files. Once you have the fragment files, you can use snapatac2 to generate the cell by bin matrix and streamline the downstream analyses.
Currently snapatac2 cannot convert bam file to fragment file. I will work on that this weekend.
I've added a function for this: https://kzhang.org/SnapATAC2/_autosummary/snapatac2.pp.make_fragment_file.html.
It is usable but not quite finished yet. You can track the progress here: #1 . Here is an example usage (for Paired-Tag data):
import snapatac2 as snap
snap.pp.make_fragment_file(
"DNA_H3K4me3.bam",
"output.bed.gz",
is_paired = False,
barcode_regex = "(..:..:..:..):\w+$",
umi_regex = ":(\w+$)",
)
Note:
reach-tools
. Because the deduplication algorithm is more stringent than reach-tools'. make_fragment_file
uses the same algorithm as samtools markdup
and picard
do.Thank you so much! I will check that.
@yoshihiko1218 Have you generated a fragment file successfully? Because I am also dealing with Paired-tag data, I have the same problem as you.
Hi Dr.Zhang,
I am working on paired-tag data and as recommended, trying to analyze paired-tag DNA sequencing data using snapATAC. I wonder how I can load the results of paired-tag data("barcodes.tsv" barcode of cells,"genes.tsv" annotation of bins,"matrix.mtx" cell times bin matrix) into snapATAC?
Thank you so much!