marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

How many memory ? #193

Closed RxLoutre closed 8 years ago

RxLoutre commented 8 years ago

Hi,

I'm having a lot of trouble with trying to run canu.

My PacBio raw reads file is 82 Giga, in a fastq format.

I've tried several times to laucnh canu with the folowing command :

canu -p susuzkii_canu -d /media/DATAPART1/Documents/suzukii_assembly/canu_assembly genomeSize=150000000 -pacbio-raw /media/DATAPART1/Documents/suzukii_assembly/pacbio_reads/filtered_subreads.fastq

DATAPART1 is a partition of my hard drive of 2 Tera, and after making more and more space, Canu fail each time, (at a different stage) because there is not enough space in DATAPART1. Last time I tired, it took more than 500 Giga of memory space. I don't know how many memory I'm supposed to keep... But I thought 500 Giga will be enough.

I wasted a lot of time, the first stages of the assembly took like 2 days, and it failed.

So I have to questions :

Thanks for your help,

Roxane

skoren commented 8 years ago

The disk space is determined by genome size and repeat content, not so much the input fastq size since the largest user of disk is storing pairwise overlaps between the raw sequences. For a human-sized genome, the peak disk usage is about 3TB. For an arabidopsis or fruit fly genomes (about same size as yours) the peak disk usage is about 300-600gb so you're probably close but I'd recommend 1TB to be safe.

It is possible to remove some intermediate outputs early during your run to save space if you give file listing in your assembly folder and in each (if you have them of correction,trimming, unitigging). To resume a failed run you can just re-run the same command and Canu will resume and keep any valid outputs and continue onto the next step.

RxLoutre commented 8 years ago

Hi skoren !

So after a few try, after buying a 2TB external hardrive, empty, and especially bought in order to have enough space to store all canu results, well it keep failing and failing again because of memory space. I'm trying to assemble a drosophila genome and you said that it's supposed to make about 300-600gb of data... But it totally full my whole external hard drive ! The correction step is taking ages each time (more than 2 days) and I start to wonder if I'm doing all of this good... I'm kind of desperate...

I really need your help...

Here my command line : canu -p suzukii -d /media/loutre/SUZUKII/canu_assembly genomeSize=150000000 -pacbio-raw /media/DATAPART1/Documents/suzukii_assembly/pacbio_reads/filtered_subreads.fastq

Here the latest log :

 -- Finished on Wed Jul 20 23:28:11 2016 (118943 seconds) with 1298.7 GB free disk space
----------------------------------------
-- Found 336 mhap overlap output files.
----------------------------------------
-- Starting command on Wed Jul 20 23:28:12 2016 with 1298.7 GB free disk space

    /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild \
     -O /media/loutre/SUZUKII/canu_assembly/correction/suzukii.ovlStore.BUILDING \
     -G /media/loutre/SUZUKII/canu_assembly/correction/suzukii.gkpStore \
     -M 2-8 \
     -L /media/loutre/SUZUKII/canu_assembly/correction/1-overlapper/ovljob.files \
     > /media/loutre/SUZUKII/canu_assembly/correction/suzukii.ovlStore.err 2>&1

-- Finished on Thu Jul 21 07:25:50 2016 (28658 seconds) with 0 GB free disk space  !!! WARNING !!!
----------------------------------------
ERROR:
ERROR:  Failed with exit code 134.  (rc=34304)
ERROR:
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

Disk space available:  0 GB

Last 50 lines of the relevant log file (/media/loutre/SUZUKII/canu_assembly/correction/suzukii.ovlStore.err):

       151567478 SAVE  - overlaps output (for dedupe)

               0 ERATE - low quality, more than 0.409 fraction error

               0 OBT   - not requested
         3055500 OBT   - too similar
          903886 OBT   - too short

               0 DUP   - dedupe not requested
               0 DUP   - different library
               0 DUP   - obviously not duplicates
bucketizing /media/loutre/SUZUKII/canu_assembly/correction/1-overlapper/results/000308.ovb.gz
safeWrite()-- Write failure on ovFile::writeOverlap: No space left on device
safeWrite()-- Wanted to write 261632 objects (size=4), wrote 59904.
ovStoreBuild: AS_UTL/AS_UTL_fileIO.C:81: void AS_UTL_safeWrite(FILE*, const void*, const char*, size_t, size_t): Assertion `(*__errno_location ()) == 0' failed.

Failed with 'Aborted'

Backtrace (mangled):

/home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild[0x40769d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f9ac60b4330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f9ac5d15c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f9ac5d19028]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fbf6)[0x7f9ac5d0ebf6]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fca2)[0x7f9ac5d0eca2]
/home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild[0x406940]
/home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild[0x412acc]
/home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild[0x4036f5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f9ac5d00f45]
/home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild[0x404349]

Backtrace (demangled):

[0] /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild() [0x40769d]
[1] /lib/x86_64-linux-gnu/libpthread.so.0::(null) + 0x10330  [0x7f9ac60b4330]
[2] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x37  [0x7f9ac5d15c37]
[3] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x148  [0x7f9ac5d19028]
[4] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x2fbf6  [0x7f9ac5d0ebf6]
[5] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0x2fca2  [0x7f9ac5d0eca2]
[6] /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild() [0x406940]
[7] /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild() [0x412acc]
[8] /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild() [0x4036f5]
[9] /lib/x86_64-linux-gnu/libc.so.6::(null) + 0xf5  [0x7f9ac5d00f45]
[10] /home/loutre/canu-1.3/Linux-amd64/bin/ovStoreBuild() [0x404349]

GDB:

Aborted (core dumped)

canu failed with 'failed to create the overlap store'.
brianwalenz commented 8 years ago

With 82GB of input fastq, you've got roughly 41Gbp of input sequence. Genome size is set to 150m, which means you've got 273x coverage.

What does the gatekeeper histogram of read lengths look like? Based on that, set a reasonably aggressive minReadLength to get, say, 75x coverage. You'll also probably want to set minOverlapLength to 1/4 to 1/2 the minimum read length (the default is 500, which will be far too short now). And you'll need to set stopOnReadQuality=0 to get past an initial check on read quality (because you've set the minReadLength to discard a large fraction of the input, and canu won't like that).

RxLoutre commented 8 years ago

I set the genome size de 150m, but it's a approximation (D. melanogaster genome size) because we don't really know the exact D. suzukii genome size. Could it be a problem ?

Thanks for your answers, where I can find this histogram ? Is it the "suzukii.ms16.histogram" file ? Also, before I try an other run with theses parameters, can I re used some of my previous results ? Or I have to start it all again ?

brianwalenz commented 8 years ago

The histogram is reported in (a) the canu screen output, (b) correction.html and (c) inside the *gkpStore directory. The first one is easiest to post here (it's text and annotated), but the second one can be attached to the issue (not attached to an email reply).

Getting the genome size wrong changes how many reads are output from correction. If it is too small by a large amount you'd get far too few reads, and the assembly won't work.

I'm expecting about 2/3 of the reads will be filtered out (by being too short) and this will dramatically reduce the number of overlaps and run time. It is possible to filter the results you have, but it'll be quicker to recompute. Lets look at that histogram first.

RxLoutre commented 8 years ago

Hi, sorry for being absent for a while.

Here the correction.html output ! correction.txt

brianwalenz commented 8 years ago

Sadly, no images saved in that file, only the total bases and total number of reads.

Do you have suzukii.gkpStore/readlengths.txt? This is just a dump of the length of each read. You can plot this as a histogram using your favorite software.

Do you have suzukii.gkpStore/readlengthhistogram.txt? This is a coarse grained text histogram, that should be good enough for deciding a cutoff.

Our goal is to find a length such that you have about 75x in reads above the length.

RxLoutre commented 8 years ago

readlengthhistogram.txt

Hi Brian,

So yes, I have theses two files (I couldn't upload the first file because it was too big), I will try to make an histogram using R with readlengths.txt and then, show you the results. Or maybe the second file is enough to decide which length to chose ?

Sorry to bother you again, but just to be sure, we try to discard very short reads in order to be faster that's right ? But how did you know that we have to change our covering from almost 300x from 75x ? Isn't a big loss to discard that many reads ? Or we can maybe use later a kind of polishing step to use all the reads ?

Thanks again for your answers,

Cheers,

Roxane

RxLoutre commented 8 years ago

0 999 0 1000 1999 404288 2000 2999 403054 3000 3999 396397 4000 4999 397424

So, as I understand, theses lines means that, I have 0 reads that have a length between 0 and 999 bp, 404 288 reads that have a length between 1000 and 1999bp, ect.

The biggest reads are between 67000 and 67999, there are two of them.

What should I consider as a short read ? But now, I totally understand why 500 was way too short, but I couldn't really say what number this minOverlapLength should be.

You said about 1/4 or 1/2 of the min read length, but with 500, we are already at this length.

brianwalenz commented 8 years ago

Perfect! Those look like nice reads.

We want to drop the short reads because they won't add anything to the assembly, and using more than 50x to 100x of reads - I don't have any actual numbers to back this up - won't help correction any either. All they do is make more data to slog through. It seems a waste, but the only way to get enough long reads for assembly is to generate even more short reads.

Once you get an assembly, you can polish it with quiver. Even there, it's probably overkill to use all the data - it can be done, but you'll have the same problem, just too slow and too big.

For the min read length:

sort -nr readlengthhistogram.txt | awk '{ sum += ($1+$2)/2 * $3; cnt += $3; print $1,$2,$3,sum/150000000,cnt; }'

21000 21999 23328 16.2103 98538
20000 20999 30855 20.427 129393
19000 19999 39594 25.5741 168987
18000 18999 51355 31.9077 220342
17000 17999 66509 39.6669 286851
16000 16999 85360 49.0562 372211
15000 15999 108521 60.2697 480732
14000 14999 138335 73.6416 619067
13000 13999 177367 89.6041 796434
12000 12999 228019 108.605 1024453
11000 11999 298156 131.463 1322609
10000 10999 383155 158.282 1705764

Which is showing that reads longer than about 12500 will give you 100x coverage, or longer than 16000 will give you 50x coverage. 14000 is almost 75x.

Setting minReadLength=14000 and minOverlapLength=7000 will definitely speed things up. You'll get a slightly better/worse assembly with the 100x or 50x points, but I can't guess how much better/worse or how much slower/faster or how much more/less disk space will be needed.

You can do the same type of analysis on the actual read lengths if you want to nail down the exact minimum read length needed to get some specific level of coverage.

RxLoutre commented 8 years ago

Wow, great ! Thanks for the answers, I'm going to try this now.

Okay I see, it seem legit that we have more short reads when asking for more longer reads.

So, I need to set -minReadLength=14000 -stopOnReadQuality=0 and -minOverlapLength=7000 is that correct ?

Also, I think I've underestimated my genome size, it more likely to be 220M than 150.

Let's try this !

RxLoutre commented 8 years ago

Hi Brian ! So my first try with thoses parameters : canu -p suzukii -d /media/loutre/SUZUKII/canu_assembly genomeSize=220000000 -pacbio-raw /media/DATAPART1/Documents/suzukii_assembly/pacbio_reads/filtered_subreads.fastq -minReadLength=14000 -stopOnReadQuality=0 -minOverlapLength=7000 has just finished.

I tried to compare the canu results (stored in suzukii.contigs.fasta if I'm right) with some others D. suzukii assembly :

quast_canu_falcon_ngs

So here attached the results of quast analysis. It seems the canu assembly is not that good, compared with theses metrics. I think that the N50, the largest contig and the number of contigs can be better.

Should I try with different parameters ? Should I try to polish it or something to see if the results get better ?

Thanks for your answers !

Cheers,

Roxane

skoren commented 8 years ago

It's possible the filters on read length and overlap length were too aggressive to get a good assembly. You can check the histogram file in trimming/suzukii.gkpStore/readlengths.txt that will tell you how long your corrected reads are. I would guess you ended up with too little coverage for a good assembly. It is also possible to adjust parameters if your genome is heterozygous.

If you are able to share the data, you can upload it to ftp://ftp.cbcb.umd.edu/incoming/sergek and we can locally take a look at the data.

RxLoutre commented 8 years ago

readlengthhistogram.txt

Hi skoren ! Here attached the readLength histogramm after trimming, so well yes I suppose that we lost a lot of reads. I launched this morning the same analysis but with a minReadLength of 7000 and a miOverlapp of 3500, so I can just see how it affects the assembly quality. What do you think about it ?

Cheers,

Roxane

skoren commented 8 years ago

I would keep the overlap size relatively low, 1000, and set the minimum read length to 7000 or 10000 (whichever fits on your file system).

RxLoutre commented 8 years ago

Thanks for your help ! I'm going to try with theses parameters.