marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
646 stars 178 forks source link

Canu takes too long to run.. I think #839

Closed Omer-K closed 6 years ago

Omer-K commented 6 years ago

I'm using MinION 1D fastq file which contains an amplicon. I should note that the PCR amplification was probably not as specific as we'd wish. Either way, it's a rather large file,

Number of reads: | 615,398 Total bases: | 980,028,664 Median read length: | 1,386.0 Mean read length: | 1,592.51 Readlength N50: | 1,901

I'm running the following command: canu -d ./Folder/ -p GeneName GenomeSize=7600 minReadLength=500 minOverlapLength=100 -nanopore-raw Reads.fastq corOutCoverage=100

Canu version is 1.6 and it's running on a remote Red Hat Enterprise Linux Server release 6.9 (Santiago).

The problem is that (I think) it takes to long to overlap reads. It's been almost two hours and just two steps of the mhap overlap jobs have finished.

Would tweaking the parameters make a difference? Perhaps suggesting to filter out lower-quality reads and start with less data in the fastq file?

Here are the last few lines where it's running currently.

-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=low based on read coverage of 126259.
--
-- PARAMETERS: hashes=256, minMatches=3, threshold=0.85
--
-- Given 6 GB, can fit 18000 reads per block.
-- For 33 blocks, set stride to 8 blocks.
-- Logging partitioning to 'correction/1-overlapper/partitioning.log'.
-- Configured 32 mhap precompute jobs.
-- Configured 77 mhap overlap jobs.
-- Finished stage 'cor-mhapConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'cormhap' concurrent execution on Tue Mar 27 17:49:16 2018 with 324.54 GB free disk space (32 processes; 1 concurrently)

    cd correction/1-overlapper
    ./precompute.sh 1 > ./precompute.000001.out 2>&1
    ./precompute.sh 2 > ./precompute.000002.out 2>&1
    ./precompute.sh 3 > ./precompute.000003.out 2>&1
    ./precompute.sh 4 > ./precompute.000004.out 2>&1
    ./precompute.sh 5 > ./precompute.000005.out 2>&1
    ./precompute.sh 6 > ./precompute.000006.out 2>&1
    ./precompute.sh 7 > ./precompute.000007.out 2>&1
    ./precompute.sh 8 > ./precompute.000008.out 2>&1
    ./precompute.sh 9 > ./precompute.000009.out 2>&1
    ./precompute.sh 10 > ./precompute.000010.out 2>&1
    ./precompute.sh 11 > ./precompute.000011.out 2>&1
    ./precompute.sh 12 > ./precompute.000012.out 2>&1
    ./precompute.sh 13 > ./precompute.000013.out 2>&1
    ./precompute.sh 14 > ./precompute.000014.out 2>&1
    ./precompute.sh 15 > ./precompute.000015.out 2>&1
    ./precompute.sh 16 > ./precompute.000016.out 2>&1
    ./precompute.sh 17 > ./precompute.000017.out 2>&1
    ./precompute.sh 18 > ./precompute.000018.out 2>&1
    ./precompute.sh 19 > ./precompute.000019.out 2>&1
    ./precompute.sh 20 > ./precompute.000020.out 2>&1
    ./precompute.sh 21 > ./precompute.000021.out 2>&1
    ./precompute.sh 22 > ./precompute.000022.out 2>&1
    ./precompute.sh 23 > ./precompute.000023.out 2>&1
    ./precompute.sh 24 > ./precompute.000024.out 2>&1
    ./precompute.sh 25 > ./precompute.000025.out 2>&1
    ./precompute.sh 26 > ./precompute.000026.out 2>&1
    ./precompute.sh 27 > ./precompute.000027.out 2>&1
    ./precompute.sh 28 > ./precompute.000028.out 2>&1
    ./precompute.sh 29 > ./precompute.000029.out 2>&1
    ./precompute.sh 30 > ./precompute.000030.out 2>&1
    ./precompute.sh 31 > ./precompute.000031.out 2>&1
    ./precompute.sh 32 > ./precompute.000032.out 2>&1

-- Finished on Tue Mar 27 18:15:02 2018 (1546 seconds) with 315.664 GB free disk space
----------------------------------------
-- All 32 mhap precompute jobs finished successfully.
-- Finished stage 'cor-mhapPrecomputeCheck', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'cormhap' concurrent execution on Tue Mar 27 18:15:02 2018 with 315.664 GB free disk space (77 processes; 1 concurrently)

    cd correction/1-overlapper
    ./mhap.sh 1 > ./mhap.000001.out 2>&1
    ./mhap.sh 2 > ./mhap.000002.out 2>&1

Thanks.

skoren commented 6 years ago

Canu isn't optimized to find overlaps at 100bp so that's increasing your compute time. You also have 130,000X coverage which will lead to lots and lots of overlaps.

I'd suggest randomly subsampling to 100-200x coverage, even if the amplification isn't as specific as you wish, you should still have enough data to assemble. Also, use the -fast option available in 1.7.

Omer-K commented 6 years ago

Thanks for the tips! You think maintaining the current file batch (no random sub-sampling) and increasing minOverlapLength to lets say 250 or 500 would speed things up?

Also, I'd make an effort the get the server IT lady to upgrade to 1.7.

skoren commented 6 years ago

I think it would speed up with a larger overlap but not that much, with 1g bases you have as much data as a medium eukaryotic genome for assembly but worse because the overlaps will be much more frequent and deep.

You can get the fast functionality in 1.6 by using overlapper=mhap utgReAlign=true.

Omer-K commented 6 years ago

Thanks, Before you replied I've filtered the reads. I only kept reads that are > 500bp and Q > 10. This reduced the number of reads almost by half. After 14.5hrs 'cor-mhapCheck' finished with 50 processes and started 'ovStoreBuild'. Since 'cor-mhapCheck' was behind me, I was willing to wait..

Now the problem is that I mistakenly pressed the ctrl+c and terminated canu (facepalm) Is there a way to restart from that point onward, instead of re-executing the whole script?

skoren commented 6 years ago

If you run the same command you used before for canu, it will pick up where it left off automatically.

Omer-K commented 6 years ago

I tried that. I got an error message:

ABORT:
ABORT: Canu 1.6
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to create the overlap store.
ABORT:
ABORT: Disk space available:  106.125 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/Gene.ovlStore.err):
ABORT:
ABORT:

I therefor restarted with a filtered fastq file (as I have mentioned before) and added the parameters on which you recommended, overlapper=mhap utgReAlign=true'

Still it takes a long time.

skoren commented 6 years ago

What's the output in correction/Gene.ovlStore.err for the failed run?

The fast option only affect things post-correction, the correction will take exactly the same time as before. Having >50,000x coverage isn't a common use case.

Omer-K commented 6 years ago

What's the output in correction/Gene.ovlStore.err for the failed run?

I've actually deleted the folder produced by this run, as I restarted and had to free-up storage..

The fast option only affect things post-correction, the correction will take exactly the same time as before. Having >50,000x coverage isn't a common use case.

Then randomly sub-sampling still is a good suggestion, I guess? How would you suggest going about this, in an easy manner? I have not specific knowledge of programming other than MATLAB and some Unix command line.

Also, what coverage should suffice?

skoren commented 6 years ago

You could do something as simple as take the beginning of the fastq input, assuming there is no bias in sequencing time. Canu also has a utility called fastqSample that lets you subsample to a desired coverage. It expects the input to be named file.u.fastq and you can run fastqSample without options to get the help.

Omer-K commented 6 years ago

Thanks, I'll give it a go and update.

Omer-K commented 6 years ago

I'm trying to use the fastqSample command: fastqSample -U file.u.fastq -I 023b484d86c10d61534e6aa274918bdd8700a8be -g 7600 -c 200

There is no error and I'm note quite sure if I wrote the command properly. What should I have specified in the -I option? I wrote the runID as I thought this was common to all reads in the fastq file but I must've gotten it wrong.

skoren commented 6 years ago

The -U option is a flag, it doesn't take parameters and you want to give the prefix to -I. Your command should be: fastqSample -U -I file -g 7600 -c 200 -O file_downsampled

This will look for a file named file.u.fastq and write to file_downsampled.u.fastq

Omer-K commented 6 years ago

Ok so I run 'fastqSample' with '-c 1000' and '10000'. Then I run the regular canu command with parameters

'GenomeSize=7600 minOverlapLength=100 -nanopore-raw downsampled.u.fastq corOutCoverage=100'

I got an error and terminated script:

Gatekeeper detected problems in your input reads.  Please review the logging in files:
  /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING.err
  /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING/errorLog
If you wish to proceed, rename the store with the following commands and restart canu.

  mv /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING \
     /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.ACCEPTED

Or remove '/home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/' and re-run with stopOnReadQuality=false

I re-run with 'stopOnReadQuality=false' and removed the 'corOutCoverage=100' and it worked quite ok.

What was the source of the initial error and what does 'stopOnReadQuality' mean?

skoren commented 6 years ago

The error is logged in ``` /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING.err

brianwalenz commented 6 years ago

Documented.

http://canu.readthedocs.io/en/latest/parameter-reference.html#stoponreadquality