haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
191 stars 21 forks source link

Counts over a given bed file #7

Open dawe opened 3 years ago

dawe commented 3 years ago

Hi all, I'm testing chromap and I'd say I'm impressed by the performances. In addition to the current options, would you consider the possibility to output counts over given intervals? Catalogs of epigenetic features are available and it would be interesting to quantify known annotations.

dawe commented 3 years ago

If this option will be ever be implemented, it may be useful to allow output in .mtx format when analysing single cell data

haowenz commented 3 years ago

We actually have this implemented but not made available to users for now. We may include this feature in the future release. For now you might be able to get this by running bedtools on Chromap bed output and interval bed files.

dawe commented 3 years ago

I vote for exposing this feature + mtx output, there are some scATAC-seq analysis approaches which do not rely on fragments, for example this and this.

dawe commented 3 years ago

Just checking… can you disclose which version of Chromap will enable this feature? Alternatively, can you hint how to expose it myself and give it a try?

haowenz commented 3 years ago

Hi,

I checked the code and found out that currently Chromap can only generate cell-by-bin matrix with fixed bin sizes instead of taking a BED file that contains the intervals to generate the matrix. It would require some time and efforts to implement and test that feature. As I have limited bandwidth at the moment, perhaps we will not include this feature in a recent release. So far you may try to see if there are other scripts or command lines that can help you generate the matrix using the output fragment file of Chromap.

dawe commented 3 years ago

My BED files are mostly fixed size bins, so I guess it will do the job! Thanks, I’ll spend some time in the following days to test it properly

dawe commented 3 years ago

Hello, I tried to expose this feature by uncommenting lines https://github.com/haowenz/chromap/blob/567fc3ed407dc577518b6ec779222d778f9eaf35/src/chromap.cc#L5941 to https://github.com/haowenz/chromap/blob/567fc3ed407dc577518b6ec779222d778f9eaf35/src/chromap.cc#L5950 but my output doesn’t seem to be binned, I have fragments resembling a ATAC output. I will still try to play with options, just wanted to be sure the expected output will be in the form of

chrom 0 5000 cell count 
chrom 5000 10000 cell count 
…
dawe commented 3 years ago

Followup, I have a scATAC-seq with R1 and R3 to be mapped and cell barcodes in R2 (starting at position 9). The output of this

chromap -x GRCm38.index -r GRCm38.fa.gz   --low-mem -t 4 -1 READ1.fq.gz -2 READ3.fq.gz --barcode READ2.fq.gz --barcode-whitelist whitelist.tsv --read-format bc:9:-1,r1:0:-1,r2:0:-1 -o output_peaks

and

chromap -x GRCm38.index -r GRCm38.fa.gz   --low-mem -t 4 -1 READ1.fq.gz -2 READ3.fq.gz --barcode READ2.fq.gz --barcode-whitelist whitelist.tsv --read-format bc:9:-1,r1:0:-1,r2:0:-1 -o output_matrix --cell-by-bin  --bin-size 5000

is the same:

$ md5sum output_atac output_matrix
a4265e26970a8d3b249ac82ccf7e5ce9  output_atac
a4265e26970a8d3b249ac82ccf7e5ce9  output_matrix

as if the --cell-by-bin --bin-size 5000 flags are not taken into account

haowenz commented 3 years ago

Thanks for trying. You also have to uncomment these two lines: https://github.com/haowenz/chromap/blob/567fc3ed407dc577518b6ec779222d778f9eaf35/src/chromap.cc#L5968-L5969

And set this parameter to a name prefix you desire, there will be several files generated with that name prefix. But overall this feature is not tested after we made changes to lots of code. So I am no sure if it will still work.

dawe commented 3 years ago

Thanks, I missed those lines. Anyhow, you were right, it doesn't work in the sense that

chromap -x GRCm38.index -r GRCm38.fa.gz   \\
--low-mem -t 4 -1 READ1.fq.gz -2 READ3.fq.gz --barcode READ2.fq.gz \\
--barcode-whitelist whitelist.tsv --read-format bc:9:-1,r1:0:-1,r2:0:-1 \\
-o FOUT --cell-by-bin  --bin-size 5000 --matrix-output-prefix FMAT

only returns FOUT but no FMAT*