kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)
https://github.com/kundajelab/chrombpnet/wiki
MIT License
107 stars 27 forks source link

Input file shifts inconsistent (Arabidopsis) #197

Open hermandebeukelaer opened 2 weeks ago

hermandebeukelaer commented 2 weeks ago

I am trying to train ChromBPNet on Arabidopsis ATAC-seq data, but run into the error "Input file shifts inconsistent" as in #153, #174 and #176. Having read through these other issues, I was able to generate the PWM for my BAM file:

atac_no_shift

The raw fastq data was processed using the nf-core ATAC-seq pipeline with Bowtie2 as the chosen aligner. To train ChromBPNet I took the merged BAM file containing the alignments of all replicates (found under bowtie2/merged_replicate/ among the output of the Nextflow pipeline).

My command used for training (I am running on a computing cluster with Apptainer):

apptainer exec  --nv -e \
                --no-mount /scratch \
                --bind data:/data \
                --bind bias_model:/bias_model \
                --bind potter-kc:/potter \
                --bind out:/output \
                chrombpnet.sif \
                chrombpnet pipeline \
                -ibam /potter/CONTROL.mRp.clN.sorted.bam \
                -d  ATAC \
                -g  /potter/ath.fasta \
                -c  /potter/ath.chrom.sizes.txt \
                -p  /data/peaks_no_blacklist.bed \
                -n  /data/background_negatives.bed \
                -fl /data/train1-3_val4_test5.json \
                -b  /bias_model/ENCSR868FGK_bias_fold_0.h5 \
                -o  /output

This worked fine when using the data from the tutorial, but yields the inconsistent shift error on my own data. To my knowledge, the nf-core ATAC-seq pipeline does not apply any shifts, at least I did not find it in the documentation.

The commands I used to generate the above PWM from my BAM file:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/tmp2
bedGraphToBigWig /out/tmp2 /potter/ath.chrom.sizes.txt /out/unstranded.bw
python build_pwm_from_bigwig.py -i /out/unstranded.bw -g /potter/ath.fasta -op /out/atac_no_shift -cr Chr1 -c /potter/ath.chrom.sizes.txt

Do you have any suggestions on how to proceed? I don't have sufficient background on Tn5 bias to judge if the PWM looks as expected. Are there any references you suggest to learn more about Tn5 bias? I am wondering if the issue could be due to specific bias in Arabidopsis or plants in general, compared to the human genome.

panushri25 commented 1 week ago

Can you generate the PWM on the plus and minus strands separately?

hermandebeukelaer commented 6 days ago

Dear @panushri25,

I have generated the logos for both strands separately.

Positive strand

atac_no_shift_pos

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="+"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/pos.bedGraph
bedGraphToBigWig /out/pos.bedGraph /potter/ath.chrom.sizes.txt /out/pos.bw
python build_pwm_from_bigwig.py -i /out/pos.bw -g /potter/ath.fasta -op /out/atac_no_shift_pos -cr Chr1 -c /potter/ath.chrom.sizes.txt

Negative strand

atac_no_shift_neg

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="-"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/neg.bedGraph
bedGraphToBigWig /out/neg.bedGraph /potter/ath.chrom.sizes.txt /out/neg.bw
python /scratch/chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i /out/neg.bw -g /potter/ath.fasta -op /out/atac_no_shift_neg -cr Chr1 -c /potter/ath.chrom.sizes.txt

Does this give you any hint on what might be the issue?