Open hermandebeukelaer opened 2 weeks ago
Can you generate the PWM on the plus and minus strands separately?
Dear @panushri25,
I have generated the logos for both strands separately.
Positive strand
Code used to generate logo:
samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="+"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/pos.bedGraph
bedGraphToBigWig /out/pos.bedGraph /potter/ath.chrom.sizes.txt /out/pos.bw
python build_pwm_from_bigwig.py -i /out/pos.bw -g /potter/ath.fasta -op /out/atac_no_shift_pos -cr Chr1 -c /potter/ath.chrom.sizes.txt
Negative strand
Code used to generate logo:
samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="-"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/neg.bedGraph
bedGraphToBigWig /out/neg.bedGraph /potter/ath.chrom.sizes.txt /out/neg.bw
python /scratch/chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i /out/neg.bw -g /potter/ath.fasta -op /out/atac_no_shift_neg -cr Chr1 -c /potter/ath.chrom.sizes.txt
Does this give you any hint on what might be the issue?
I am trying to train ChromBPNet on Arabidopsis ATAC-seq data, but run into the error
"Input file shifts inconsistent"
as in #153, #174 and #176. Having read through these other issues, I was able to generate the PWM for my BAM file:The raw fastq data was processed using the nf-core ATAC-seq pipeline with Bowtie2 as the chosen aligner. To train ChromBPNet I took the merged BAM file containing the alignments of all replicates (found under
bowtie2/merged_replicate/
among the output of the Nextflow pipeline).My command used for training (I am running on a computing cluster with Apptainer):
This worked fine when using the data from the tutorial, but yields the inconsistent shift error on my own data. To my knowledge, the nf-core ATAC-seq pipeline does not apply any shifts, at least I did not find it in the documentation.
The commands I used to generate the above PWM from my BAM file:
Do you have any suggestions on how to proceed? I don't have sufficient background on Tn5 bias to judge if the PWM looks as expected. Are there any references you suggest to learn more about Tn5 bias? I am wondering if the issue could be due to specific bias in Arabidopsis or plants in general, compared to the human genome.