jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
185 stars 27 forks source link

Consider fragments file as input. #95

Closed maxim-h closed 1 year ago

maxim-h commented 2 years ago

Hello and thanks for the software!

I was wondering if you would consider additional input formats like 10x Fragments file, which is common for scATAC data. I've seen a couple of cases when obtaining the BAMs for public data wasn't an option, but fragments file was. Also tools like chromap currently support SAM output only at a severe performance cost, though they are planning to fix it at some point.

Fragments file is basically BED, and after checking #85 I've successfully ran Genrich on it as follows:

bgzip -d -c atac_fragments.tsv.gz |\ #unzip the file
        awk -v ln=1 '{OFS="\t"; $4=ln++ $4; print}' |\ #add line number to read name (i.e. cell barcode)
        sed 's/.*/&\t+\n&\t-/' |\ #duplicate each line to simulate paired reads; add strand info
        bedtools bedtobam -i - -ubam -g ./hg38.chrom.sizes |\ #convert BED to BAM
        samtools view -h | \ #convert BAM to SAM
        awk '{OFS="\t"; if ($1 ~/^@/) {print} else {{$7="="; $8=$4}; if ($2 == 0) {$2=67;print} else if ($2 == 16) {$2=147;print}}}' |\ #manually edit FLAG, RNEXT, PNEXT
Genrich -j \
       -t - \
       -o ./output/peaks_from_fragments.narrowPeak \
       -z \
       -D \ # adjustment is already done in fragments file
       -S \ # it's coordinate sorted, but grouped by QNAME, so should be fine (?)
       -q 0.01 \
       -v 2> ./output/genr_from_fragments.log

You can see that this pre-processing is a little involved and error prone. Additionally it negates the great performance of Genrich because BedToBam is a major bottleneck.

Do you think it would make sense to add support for Fragment files or just BED files in general?

jsh58 commented 2 years ago

Thanks for the question. This is not an unreasonable request, though I hesitate to commit to it for several reasons:

maxim-h commented 1 year ago

I see. Your reasoning makes sense. And for now I'm fine with the solution I found. So probably no reason to keep the issue open.

P.S.: I actually wrote a very stupid conversion program to replace bedtools step. Don't really recommend anyone to use it. But it does supply Genrich as fast as it can read it.