marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

ovlRefBlockSize #1441

Closed fergsc closed 4 years ago

fergsc commented 5 years ago

Hi, I am running into some job walltime limits and trying to reduce obtOvlRefBlockSize to create more shorter jobs. But I can not work out how to set this parameter. ERROR: Parameter 'obtOvlRefBlockSize' is not known. I am using version 1.8 and have tried a number of different combinations of capitalisation, eg. obtOvlRefBlockSize=5000 or obtovlRefBlockSize=5000

thanks.

skoren commented 5 years ago

RefBlockSize was deprecated as not a supported option, you can always get a list of supported options by running canu -options. Instead you'd want to use obtOvlRefBlockLength which is the same as Size but in base pairs. Rather than changing this parameter, which requires you to re-run all overlaps from scratch (since you're changing the partitions) would instead advice you to increase the threads used by each job and perhaps decrease the correctedErrorRate. You can manually edit the overlap.sh file to make these changes and re-run just the jobs that time out.

fergsc commented 5 years ago

Thanks, will change the number of threads and see if I can get the jobs to finish within 96 hours.

brianwalenz commented 5 years ago

Is this a long nanopore assembly? There's a known slowdown in obtOvl and utgOvl when reads are very very long. The only solution is to switch to using the faster but less sensitive mhap overlapper. This will also require you to recompute all the obt overlaps (since you're changing the algorithm). One way to check if this is the problem is to monitor actual CPU usage of the obtOvl jobs - if it's down to using just a few threads, it's stuck trying to find alignments between two long reads.

fergsc commented 5 years ago

Some summary statistics on my reads, and yes they are nanopore. Mean read length: 33,109 Median read length: 29,654 Number of reads: 805,078 Read length N50: 35,565 Est coverage: 41x

I have ~8 related plant species that I am assembling. The main problems with these species are that they have heterozygosity of ~5% and contain a very high amount of tandem repeats. During assembly I want to try and collapse as little of the highly heterozygous regions as possible but am happy to collapse the low heterogeneous regions, so am playing around with a few strategies. And trying to reduce compute time.

skoren commented 5 years ago

There are parameters suggested for heterozygous genomes on the FAQ. Typically, any region over 2% diverged will get separated, 5% I expect will give you double the genome size. That means you want to treat your genome as if it's really double the haploid genome size both in terms of how many reads to correct (corOutCoverage for example).

The FAQ also lists some suggested parameters for repetitive genomes. Some of those may help you, especially if you ended up with a very high repeat k-mer threshold. There should be a file in trimming/0-mercounts/*dump, check the values of the second column and what the minimum is. If it's very high (over 1000) then adding the repeat options from the FAQ should help turn it down and save on overlapping time too.

fergsc commented 5 years ago

I found these setting to use, that are described as suppressing repeats. This has no/little effect on the end assembly? The minimum values in the dump file are ~450, so I guess these are not needed.

corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerDistinct=0.975.

I am currently running canu with the 'Avoid collapsing the genome' settings and will be able to compare to my previous run which was preformed with the default settings.

thanks for the advice.

skoren commented 5 years ago

Yes, we've seen that it doesn't significantly impact assembly plus if you can't get an assembly with defaults then you don't have much of a choice anyway.

Given your threshold, the parameters aren't likely to make much difference in speed on your genome. That essentially means you are back to @brianwalenz's advice to use the faster overlapper or wait. You could also increase the cores given to each individual job or slightly decrease the identity (the default was 14% for older releases you would probably be ok going down to 12% or even 10% for more recent nanopore data).

skoren commented 4 years ago

Idle.