Open yermek2030 opened 1 year ago
Yes you can use fanc pairs
for this. First, create a .pairs
file as outline in the docs.
Then use fanc pairs
filter options again to filter the object. You will probably see the greatest effect with point 1:
fanc pairs --filter-self-ligations
removes all reads where both ends map to the same fragmentfanc pairs --ligation-error-plot
creates "ligation error plot" as shown herefanc pairs -i <inward cutoff> -o <outward cutoff>
Hi ,
@yermek2030 I encountered a similar problem so I was wondering if you succeeded in removing the peak around 100 bp? (and with which settings -i, -o, -d). Because I followed the instructions @kaukrise suggested above but the peak is still present. (using -i 10000 -o 10000 -d 5000 -l
)
Additionally, @kaukrise I was wondering If you maybe have an explanation why I also have a peak around 10 bp? I don't understand why this peak is present and how I can remove this.
Thank you in advance for helping me!
Dear Simone,
I tried to do what @kaukrise suggested and it didn't work for me too. I did not write a reply to @kaukrise here because of this - I just pressed the 'like' button for now. I was thinking to return to this issue when I had more time. It would be great to hear how this issue can be solved.
Message ID: @.***>
Hi everyone,
could you please post the complete commands you used to do the filtering? It is difficult to guess where exactly the filtering might have failed otherwise.
Thank you!
@SOlsthoorn I don't know why you have a peak around 10bp, either. Your restriction profile looks a bit unusual in general to me - the mean insert size is quite small, and there are no large fragments at all. But this could be related to your particular protocol
Hi @kaukrise,
First I created a pairs file using:
fanc pairs bam_1.bam bam_2.bam pairs_output.pairs -g hg38_DpnII_fragments.bed -m -us -q 3 -t 20
Then I made a ligation error plot and a restriction site distance plot:
and based on that I filtered the pairs file:
fanc pairs pairs_output.pairs -i 10000 -o 10000 -d 5000 -l -p 2 -s stats.txt
resulting in the previous restriction site distance plot I send earlier.
So it seems like the peak around 100bp decreases a bit, but not completely disappeared and the peak around 10bp only increased in density ?
Yes, the restriction profile does indeed appear unusual. I thought that perhaps the mean would shift a bit more towards 200 bp, and the pattern would resemble a more normal distribution if the high peaks at the beginning were removed. In the protocol we only removed fragments <200bp and when we look at the distribution of the bioanalyzer a peak around 800 bp appears.
Thanks for your help!
Hi @kaukrise,
Sorry to bother you again with my issue. But could you maybe provide a more detailed explanation of the restriction site distance plot, what it represents and how it's constructed? On the FAN-C website, there is mention of using the -d parameter to filter read pairs based on their cumulative distance to the nearest restriction site. It's suggested that generating this re-dist plot, using only 10,000 read pairs, is valuable for determining -d and identifying library issues. Could you elaborate on this process? Maybe if I understand this better it can help me figure out the origin of short range peaks and find a way to remove them.
Thanks again in advance for your help!
Dear @kaukrise and @SimoneO98,
Thank you for your patience.
I cleaned my fastq files using Trimmomatic:
trimmomatic PE CTKOTd6r1_S11_R1_001.fastq CTKOTd6r1_S11_R2_001.fastq CTKOTd6r1_S11_R1.paired.fastq CTKOTd6r1_S11_R1.unpaired.fastq CTKOTd6r1_S11_R2.paired.fastq CTKOTd6r1_S11_R2.unpaired.fastq -trimlog trimlog_test ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10:2:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25 awk '$6>0 {print}' trimlog_test | wc TruSeq2-PE 4900918 29405508 305969367 TruSeq3-PE-2 4900918 29405508 305969367 TruSeq3-PE-2_Minlen 120: 0 0 0 TruSeq2-PE_Minlen 120: 0 0 0 TruSeq2-PE_Minlen 101: 0 0 0 TruSeq3-PE-2_Minlen_75: 3045614 18273684 190132512 TruSeq3-PE_Minlen_75: 3045614 18273684 190132512
Separately aligned forward and reverse FASTQ files using BWA:
bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R1.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R1_75trimmed.bam bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R2.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R2_75trimmed.bam
Created 'pairs' file according to the manual:
fanc pairs CTKOTd6r1_S11_R1_75trimmed.bam CTKOTd6r1_S11_R2_75trimmed.bam CTKOTd6r1_S11_R1_paired.pairs -g ~/myzdisk/0_genomes/mm39/mm39_DpnII.bed -us -q 3
Plotted the graphs according to the manual:
fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs fanc pairs --statistics-plot CTKOTd6r1_S11_paired.pairs_stats.png CTKOTd6r1_S11_paired.pairs fanc pairs --ligation-error-plot CTKOTd6r1_S11_paired.pairs_ligation-err.png CTKOTd6r1_S11_paired.pairs
Followed @kaukrise advice and tried to filter self-ligations:
fanc pairs --filter-self-ligations CTKOTd6r1_S11_paired.pairs
Re-plotted the image after filtering the pairs file. No new file was generated as a result of filtering, so I guessed that the existing file was modified:
fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs
Dear members of Vaquerizas lab,
I was wondering if there is a way to perform cleaning of the FASTQ or BAM files from the 'Dangling ends" fragments only? The necessity stems from the fact that I have performed BAM file alignment using standard BWA MEME and SAMTOOLS. Now when I plot the 'fragment size vs density' plot using FAN-C I have a peak at around ~100-110bp (Figure1), which disrupts the expected bell-like shape of the distribution we get with data from public repositories. Given that all the adaptors were removed prior to generating BAM files, I suspect that the peak comes from 'Dangling ends' as classified by the QC plot output of FAN-C, because the barplot for this type of contaminant is the only one that significantly differs from the (presumably 'clean') public datasets we use. If possible I would like to test my hypothesis by finding and removing 'dangling ends' fragments only. Many thanks.