Closed Omer-K closed 6 years ago
Canu isn't optimized to find overlaps at 100bp so that's increasing your compute time. You also have 130,000X coverage which will lead to lots and lots of overlaps.
I'd suggest randomly subsampling to 100-200x coverage, even if the amplification isn't as specific as you wish, you should still have enough data to assemble. Also, use the -fast option available in 1.7.
Thanks for the tips!
You think maintaining the current file batch (no random sub-sampling) and increasing minOverlapLength
to lets say 250 or 500 would speed things up?
Also, I'd make an effort the get the server IT lady to upgrade to 1.7.
I think it would speed up with a larger overlap but not that much, with 1g bases you have as much data as a medium eukaryotic genome for assembly but worse because the overlaps will be much more frequent and deep.
You can get the fast functionality in 1.6 by using overlapper=mhap utgReAlign=true
.
Thanks, Before you replied I've filtered the reads. I only kept reads that are > 500bp and Q > 10. This reduced the number of reads almost by half. After 14.5hrs 'cor-mhapCheck' finished with 50 processes and started 'ovStoreBuild'. Since 'cor-mhapCheck' was behind me, I was willing to wait..
Now the problem is that I mistakenly pressed the ctrl+c and terminated canu (facepalm) Is there a way to restart from that point onward, instead of re-executing the whole script?
If you run the same command you used before for canu, it will pick up where it left off automatically.
I tried that. I got an error message:
ABORT:
ABORT: Canu 1.6
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: failed to create the overlap store.
ABORT:
ABORT: Disk space available: 106.125 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/Gene.ovlStore.err):
ABORT:
ABORT:
I therefor restarted with a filtered fastq file (as I have mentioned before) and added the parameters on which you recommended, overlapper=mhap utgReAlign=true'
Still it takes a long time.
What's the output in correction/Gene.ovlStore.err for the failed run?
The fast option only affect things post-correction, the correction will take exactly the same time as before. Having >50,000x coverage isn't a common use case.
What's the output in correction/Gene.ovlStore.err for the failed run?
I've actually deleted the folder produced by this run, as I restarted and had to free-up storage..
The fast option only affect things post-correction, the correction will take exactly the same time as before. Having >50,000x coverage isn't a common use case.
Then randomly sub-sampling still is a good suggestion, I guess? How would you suggest going about this, in an easy manner? I have not specific knowledge of programming other than MATLAB and some Unix command line.
Also, what coverage should suffice?
You could do something as simple as take the beginning of the fastq input, assuming there is no bias in sequencing time. Canu also has a utility called fastqSample
that lets you subsample to a desired coverage. It expects the input to be named file.u.fastq
and you can run fastqSample
without options to get the help.
Thanks, I'll give it a go and update.
I'm trying to use the fastqSample command:
fastqSample -U file.u.fastq -I 023b484d86c10d61534e6aa274918bdd8700a8be -g 7600 -c 200
There is no error and I'm note quite sure if I wrote the command properly. What should I have specified in the -I option? I wrote the runID as I thought this was common to all reads in the fastq file but I must've gotten it wrong.
The -U
option is a flag, it doesn't take parameters and you want to give the prefix to -I
. Your command should be:
fastqSample -U -I file -g 7600 -c 200 -O file_downsampled
This will look for a file named file.u.fastq
and write to file_downsampled.u.fastq
Ok so I run 'fastqSample' with '-c 1000' and '10000'. Then I run the regular canu command with parameters
'GenomeSize=7600 minOverlapLength=100 -nanopore-raw downsampled.u.fastq corOutCoverage=100'
I got an error and terminated script:
Gatekeeper detected problems in your input reads. Please review the logging in files:
/home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING.err
/home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING/errorLog
If you wish to proceed, rename the store with the following commands and restart canu.
mv /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING \
/home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.ACCEPTED
Or remove '/home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/' and re-run with stopOnReadQuality=false
I re-run with 'stopOnReadQuality=false' and removed the 'corOutCoverage=100' and it worked quite ok.
What was the source of the initial error and what does 'stopOnReadQuality' mean?
The error is logged in ``` /home/myfolder/20180321_1508_gene_7_first_extraction_basecalled/workspace/pass/gene_downsampled_10000/correction/Gene.116.gkpStore.BUILDING.err
I'm using MinION 1D fastq file which contains an amplicon. I should note that the PCR amplification was probably not as specific as we'd wish. Either way, it's a rather large file,
Number of reads: | 615,398 Total bases: | 980,028,664 Median read length: | 1,386.0 Mean read length: | 1,592.51 Readlength N50: | 1,901
I'm running the following command:
canu -d ./Folder/ -p GeneName GenomeSize=7600 minReadLength=500 minOverlapLength=100 -nanopore-raw Reads.fastq corOutCoverage=100
Canu version is 1.6 and it's running on a remote Red Hat Enterprise Linux Server release 6.9 (Santiago).
The problem is that (I think) it takes to long to overlap reads. It's been almost two hours and just two steps of the mhap overlap jobs have finished.
Would tweaking the parameters make a difference? Perhaps suggesting to filter out lower-quality reads and start with less data in the fastq file?
Here are the last few lines where it's running currently.
Thanks.