NGMLR very slow on bovine nanopore reads

sdjebali commented 4 years ago

Dear all,

First of all, thanks for this very nice development.

I just wanted to report the fact that on some quite heavy ONT runs from bovine, NGMLR followed by sort was very slow (about 4 days for 4 million reads).

And I was wondering if I was using the tool correctly (right parameters)?

I tried with the first 1 million reads like this: zcat $fastq | head -n 4000000 | ngmlr --presets ont -t 22 -r $genome | samtools sort -@ 6 -o $output and it took 5h23 to complete

I then tried with the second 1 million reads like this: zcat $fastq | tail -n+4000000 | head -n 4000000 | ngmlr --presets ont -t 22 -r $genome | samtools sort -@ 4 -o $output and it took 24h10 to complete

I am using NGMLR version 0.2.8 and samtools version 1.9, and here are the details about my machine : Linux tatum 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/ 24 processors Linuxprocessor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz

Any advice would be warmly welcome?

Best, Sarah

fritzsedlazeck commented 4 years ago

Thanks Sarah, do you have an average read length? Its likely but unfortunate that some of your 2nd patch reads are very long.. Thanks Fritz

sdjebali commented 4 years ago

Indeed there seems to be a big read length difference between the two batches.

I ran Nanoplot on them and here are the results :

First 1 Million reads: General summary:
Mean read length: 4,722.5 Mean read quality: 4.4 Median read length: 906.0 Median read quality: 4.2 Number of reads: 1,000,000.0 Read length N50: 14,404.0 Total bases: 4,722,479,679.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 367454 (36.7%) 3015.3Mb Q7: 8 (0.0%) 0.1Mb Q10: 0 (0.0%) 0.0Mb Q12: 0 (0.0%) 0.0Mb Q15: 0 (0.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 7.0 (17272) 2: 7.0 (9848) 3: 7.0 (25242) 4: 7.0 (12091) 5: 7.0 (25093) Top 5 longest reads and their mean basecall quality score 1: 2210466 (3.6) 2: 1850945 (3.8) 3: 1772717 (3.6) 4: 1685671 (3.9) 5: 1563326 (3.9)
second 1 Million reads General summary:
Mean read length: 13,668.0 Mean read quality: 11.1 Median read length: 13,451.0 Median read quality: 11.8 Number of reads: 1,000,000.0 Read length N50: 16,657.0 Total bases: 13,668,019,254.0 Number, percentage and megabases of reads above quality cutoffs

Q5: 963153 (96.3%) 13574.0Mb Q7: 937982 (93.8%) 13387.4Mb Q10: 781757 (78.2%) 10950.3Mb Q12: 446035 (44.6%) 6333.8Mb Q15: 165 (0.0%) 1.6Mb Top 5 highest mean basecall quality scores and their read lengths 1: 16.3 (2090) 2: 16.2 (243) 3: 16.1 (362) 4: 16.1 (570) 5: 16.1 (1509) Top 5 longest reads and their mean basecall quality score 1: 884004 (3.7) 2: 274368 (5.2) 3: 187850 (4.8) 4: 150969 (3.8) 5: 124444 (9.8)

so 13kb vs 4kb

If we still want to use NGMLR on these data, is there any option that can speed the process up?

Best, Sarah

philres / ngmlr

NGMLR very slow on bovine nanopore reads #70