vaquerizaslab / fanc

FAN-C: Framework for the ANalysis of C-like data
GNU General Public License v3.0
107 stars 14 forks source link

Cleaning by removing 'Dangling ends' fragments only? #164

Open yermek2030 opened 1 year ago

yermek2030 commented 1 year ago

Dear members of Vaquerizas lab,

I was wondering if there is a way to perform cleaning of the FASTQ or BAM files from the 'Dangling ends" fragments only? The necessity stems from the fact that I have performed BAM file alignment using standard BWA MEME and SAMTOOLS. Now when I plot the 'fragment size vs density' plot using FAN-C I have a peak at around ~100-110bp (Figure1), which disrupts the expected bell-like shape of the distribution we get with data from public repositories. Given that all the adaptors were removed prior to generating BAM files, I suspect that the peak comes from 'Dangling ends' as classified by the QC plot output of FAN-C, because the barplot for this type of contaminant is the only one that significantly differs from the (presumably 'clean') public datasets we use. If possible I would like to test my hypothesis by finding and removing 'dangling ends' fragments only. Many thanks. github

kaukrise commented 1 year ago

Yes you can use fanc pairs for this. First, create a .pairs file as outline in the docs.

Then use fanc pairs filter options again to filter the object. You will probably see the greatest effect with point 1:

  1. fanc pairs --filter-self-ligations removes all reads where both ends map to the same fragment
  2. fanc pairs --ligation-error-plot creates "ligation error plot" as shown here
  3. From the ligation error plot, choose appropriate cutoffs for ligation errors (see docs and filter using fanc pairs -i <inward cutoff> -o <outward cutoff>
SimoneO98 commented 1 year ago

Hi ,

@yermek2030 I encountered a similar problem so I was wondering if you succeeded in removing the peak around 100 bp? (and with which settings -i, -o, -d). Because I followed the instructions @kaukrise suggested above but the peak is still present. (using -i 10000 -o 10000 -d 5000 -l)

Additionally, @kaukrise I was wondering If you maybe have an explanation why I also have a peak around 10 bp? I don't understand why this peak is present and how I can remove this. image

Thank you in advance for helping me!

yermek2030 commented 1 year ago

Dear Simone,

I tried to do what @kaukrise suggested and it didn't work for me too. I did not write a reply to @kaukrise here because of this - I just pressed the 'like' button for now. I was thinking to return to this issue when I had more time. It would be great to hear how this issue can be solved.

Message ID: @.***>

kaukrise commented 1 year ago

Hi everyone,

could you please post the complete commands you used to do the filtering? It is difficult to guess where exactly the filtering might have failed otherwise.

Thank you!

@SOlsthoorn I don't know why you have a peak around 10bp, either. Your restriction profile looks a bit unusual in general to me - the mean insert size is quite small, and there are no large fragments at all. But this could be related to your particular protocol

SimoneO98 commented 1 year ago

Hi @kaukrise,

First I created a pairs file using: fanc pairs bam_1.bam bam_2.bam pairs_output.pairs -g hg38_DpnII_fragments.bed -m -us -q 3 -t 20

Then I made a ligation error plot and a restriction site distance plot: image image

and based on that I filtered the pairs file: fanc pairs pairs_output.pairs -i 10000 -o 10000 -d 5000 -l -p 2 -s stats.txt resulting in the previous restriction site distance plot I send earlier.

So it seems like the peak around 100bp decreases a bit, but not completely disappeared and the peak around 10bp only increased in density ?

Yes, the restriction profile does indeed appear unusual. I thought that perhaps the mean would shift a bit more towards 200 bp, and the pattern would resemble a more normal distribution if the high peaks at the beginning were removed. In the protocol we only removed fragments <200bp and when we look at the distribution of the bioanalyzer a peak around 800 bp appears.

Thanks for your help!

SimoneO98 commented 1 year ago

Hi @kaukrise,

Sorry to bother you again with my issue. But could you maybe provide a more detailed explanation of the restriction site distance plot, what it represents and how it's constructed? On the FAN-C website, there is mention of using the -d parameter to filter read pairs based on their cumulative distance to the nearest restriction site. It's suggested that generating this re-dist plot, using only 10,000 read pairs, is valuable for determining -d and identifying library issues. Could you elaborate on this process? Maybe if I understand this better it can help me figure out the origin of short range peaks and find a way to remove them.

Thanks again in advance for your help!

yermek2030 commented 11 months ago

Dear @kaukrise and @SimoneO98,

Thank you for your patience.

  1. I cleaned my fastq files using Trimmomatic:

    trimmomatic PE CTKOTd6r1_S11_R1_001.fastq CTKOTd6r1_S11_R2_001.fastq CTKOTd6r1_S11_R1.paired.fastq CTKOTd6r1_S11_R1.unpaired.fastq CTKOTd6r1_S11_R2.paired.fastq CTKOTd6r1_S11_R2.unpaired.fastq -trimlog trimlog_test ILLUMINACLIP:adapters/TruSeq3-PE-2.fa:2:30:10:2:True LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:25 awk '$6>0 {print}' trimlog_test | wc TruSeq2-PE 4900918 29405508 305969367 TruSeq3-PE-2 4900918 29405508 305969367 TruSeq3-PE-2_Minlen 120: 0 0 0 TruSeq2-PE_Minlen 120: 0 0 0 TruSeq2-PE_Minlen 101: 0 0 0 TruSeq3-PE-2_Minlen_75: 3045614 18273684 190132512 TruSeq3-PE_Minlen_75: 3045614 18273684 190132512

  2. Separately aligned forward and reverse FASTQ files using BWA:

    bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R1.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R1_75trimmed.bam bwa mem -A 1 -B 4 -E 50 -L 0 -t 8 ../0_genomes/mm39/mm39.fa cleaned_by_TruSeq3-PE-2-75/CTKOTd6r1_S11_R2.paired.fastq | samtools view -Shb - > mm39_bwa_out/CTKOTd6r1_S11_R2_75trimmed.bam

  3. Created 'pairs' file according to the manual:

    fanc pairs CTKOTd6r1_S11_R1_75trimmed.bam CTKOTd6r1_S11_R2_75trimmed.bam CTKOTd6r1_S11_R1_paired.pairs -g ~/myzdisk/0_genomes/mm39/mm39_DpnII.bed -us -q 3

  4. Plotted the graphs according to the manual:

    fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs fanc pairs --statistics-plot CTKOTd6r1_S11_paired.pairs_stats.png CTKOTd6r1_S11_paired.pairs fanc pairs --ligation-error-plot CTKOTd6r1_S11_paired.pairs_ligation-err.png CTKOTd6r1_S11_paired.pairs

  5. Followed @kaukrise advice and tried to filter self-ligations:

    fanc pairs --filter-self-ligations CTKOTd6r1_S11_paired.pairs

  6. Re-plotted the image after filtering the pairs file. No new file was generated as a result of filtering, so I guessed that the existing file was modified:

    fanc pairs --re-dist-plot CTKOTd6r1_S11_paired_re-dist.png CTKOTd6r1_S11_paired.pairs