hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
139 stars 27 forks source link

F5C is not processing all of the RNA reads #82

Closed lmulroney closed 3 years ago

lmulroney commented 3 years ago

Hi @hasindu2008

The speed increase of f5c over nanopolish is impressive. I am interesting in continuing to use f5c, but I noticed that when processing RNA data there is a discrepancy on the number of reads processed with nanopolish. F5c eventalign is processing ~57% of the reads that nanopolish eventalign is processing.

At first we thought this was an indexing issue, but the outcome is the same with or without using the sequence summary file for indexing.

I did some systematic testing using a single flowcell (UCSC_Run5) which has 628,824 reads from the RNA consortium data linked here https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md

Nanopolish_vs_f5c

I installed f5c using conda and used the following commands:

conda activate ont_env f5c index -d fast5_directory UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.fastq f5c eventalign --rna -t 24 --iop 10 -r UCSC_run5/no_seq_sum/test1/UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.fastq -b UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam -g gencode.v36.transcripts.fa --scale-events --print-read-names --samples > UCSC_Run5_f5c_eventalign_NoseqSum.txt

One output message that stood out was this one: [meth_main] total entries: 361155, qc fail: 6564, could not calibrate: 331, no alignment: 2441, bad fast5: 0

It appears that ~57% of the reads are being processed by 5fc. However, the similar message from nanopolish appears to be counting each read ~four times: [post-run summary] total reads: 2101455, unparseable: 0, qc fail: 51474, could not calibrate: 6833, no alignment: 15153, bad fast5: 0

I would appreciate your thoughts as to why so many RNA reads are being skipped by f5c.

jts commented 3 years ago

Hi,

I just want to note that I recently fixed a double-counting issue in the summary line:

https://github.com/jts/nanopolish/commit/7fde4f482f5c0a04fd135178dc83be91c16fe85d

Jared

lmulroney commented 3 years ago

Hi Jared,

Thanks for pointing that out. I'll make sure to use that edit going forwards!

Logan

hasindu2008 commented 3 years ago

It is highly likely to be what @jts mentioned. Thanks, @jts for mentioning this.

@lmulroney Could you please do a samtools flagstat UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam? Just to verify if this difference could be due to f5c not processing the secondary reads and/or alignments < 20 MAPQ, unless --secondary=yes --min-mapq 0 specified. But as I am aware, nanopolish also has a MAPQ cut off at 20, which effectively eliminates secondary alignments as their MAPQ tends to be always 0.

lmulroney commented 3 years ago

Hi @hasindu2008, Here is the output from samtools flagstat:

samtools flagstat UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam 2126894 + 0 in total (QC-passed reads + QC-failed reads) 1471158 + 0 secondary 26912 + 0 supplementary 0 + 0 duplicates 2101461 + 0 mapped (98.80% : N/A) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (N/A : N/A) 0 + 0 with itself and mate mapped 0 + 0 singletons (N/A : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Let me know if there is anything else I can do to help.

Cheers, Logan

lmulroney commented 3 years ago

I will try rerunning both tools using the alignment conditions --secondary=yes --min-mapq 0 and see if there are any changes.

All the best, Logan

hasindu2008 commented 3 years ago

@lmulroney

Also could you please do a samtools view UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam | awk '{if($5>=20){print $5}}' | wc -l to see how many entries are having MAPQ >= 20? This number should be apprximately equal to the 361,155 that f5c gave.

To run f5c for all primary mappings, use --min-mapq 0 and f5c will process ~630,000 total entries. To also include secondary mappings, use --secondary=yes --min-mapq 0, and f5c will process ~2,101,000 entries.

lmulroney commented 3 years ago

@hasindu2008

Here are the results from using f5c with --secondary=yes --min-mapq 0 [meth_main] total entries: 2101461, qc fail: 50552, could not calibrate: 6524, no alignment: 16389, bad fast5: 0 Complete overlap between nanopolish and f5c now.

samtools view UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam | awk '{if($5>=20){print $5}}' | wc -l 361155

lmulroney commented 3 years ago

Venn diagram of reads processed by Nanopolish and f5c using --secondary=yes --min-mapq 0 options image

Logan

hasindu2008 commented 3 years ago

Thanks @lmulroney , so seems like Nanoplish eventalign default behaviour is to process all the records in the BAM file despite the flags or mapq, while Nanoplish call-methylation only considers MAPQ>=20. So I suggest using the --secondary=yes --min-mapq 0 parameters with f5c in future for eventalign, if you do not want to ignore low-quality mappings.

Out of curiosity, what are the execution times of f5c --secondary=yes --min-mapq 0 vs nanopolish?

lmulroney commented 3 years ago

Hi @hasindu2008 Here are the commands I used to run and the CPU and turnaround times.

f5c f5c eventalign --rna -t 24 --iop 10 -r UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.fastq -b UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam -g gencode.v36.transcripts.fa --scale-events --print-read-names --samples --secondary=yes --min-mapq 0 > UCSC_Run5_f5c_eventalign_NoseqSum_test1.2.txt

CPU time : 174979.00 sec. Turnaround time : 10885 sec.

Nanopolish nanopolish eventalign -t 24 -r UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.fastq -b UCSC_Run5_20170922_DirectRNA.pass.dedup.NoU.minimap2_gencodev36_sorted.bam -g gencode.v36.transcripts.fa --scale-events --print-read-names --samples > UCSC_Run5_nanopolish_eventalign_NoseqSum_test1.2.txt

CPU time : 164984.23 sec. Turnaround time : 89460 sec.

hasindu2008 commented 3 years ago

I am closing this issue. If you have any more troubles with f5c do not hesitate to reopen or start a new issue.