marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

What parameter determines the MHAP overlap file size? #334

Closed ruiguo-bio closed 7 years ago

ruiguo-bio commented 7 years ago

The MHAP overlap file size veries on different input, and I want to know the reason for this. I set the genomeSize to 200M on two different fasta input, both contains 1500000 sequences, but the on one input, the MHAP file size is 400M for each on average, but the other is 32G for each on average. Why so big? Is there some way to reduce the MHAP single file size?

brianwalenz commented 7 years ago

Depth of coverage in input reads is first, and repeat content in the reads is second. mhapSensitivity might have a minor effect, but mostly, there is no parameter - the file size is the number of overlaps in the reads. More depth of coverage means more overlaps.

The genome size parameter merely lets us compute, given the number of bases in the input sequences, what the expected depth of coverage is. We use that for setting some parameters.

ruiguo-bio commented 7 years ago

Thank you. I'm not sure yet.

Do you mean if I set corOutCoverage=400 and corMinCoverage=0, the genomeSize will not matter the MHAP file size? If I want to stop after overlap, so the genomeSize will not matter, but the system resources may be depend on that parameter, and sometimes it says not enough memory. Can you add some control and the genomeSize parameter will not require so much memory if not assembly the genome? Also I find 1.4 is slower than 1.3 when calculating overlap.

Does the read length distribution matter? If two files with same size, and one file contain reads with normal distribution with mean 17000bp, and another file is right skewed to 3000bp, containing a lot reads under 2000bp. Will the MHAP file size be different?

brianwalenz commented 7 years ago

You can explicitly set memory limits for each component with the various 'memory' parameters - e.g., batMemory=32g. I think you're only correcting reads, but canu is still trying to find a place for bogart to run even though it won't ever run. It is a good suggestion to not care about bogart if only correcting reads, but not something easily accomplished.

The mhap output size is proportional to the number of overlaps found. The number of overlaps is mostly related to depth of coverage, but repeat content can add LOTS of overlaps. If those two files are from different organisms, then nothing can be said about the number of overlaps found - both depth of coverage and repeat content will be different. If they are from the same organism, then both depth of coverage and repeat content will be constant, and I'd expect the smaller reads to have more overlaps simply because there are more reads. Smaller reads also mean that some reads could be entirely repeat, and will have overlaps to MANY other reads, where a non-repetitive read will have only 2N overlaps, where N=depth-of-coverage.

ruiguo-bio commented 7 years ago

Thank you. The raw MHAP file is quite big if I set saveOverlaps=true, but the ovl.gz file is quite small. I hope I can leave the ovl.gz file, and delete the mhap file if I set saveOverlaps=true or use some trick. If someone want to use the original mhap file, they can convert the ovl.gz file to mhap file. But not if I don't set saveOverlaps=true, both types of two overlap file will be deleted.

brianwalenz commented 7 years ago

Overlaps are reported twice here. Once in the original ASCII mhap format, and once in the binary format the assembler uses to build the 'ovlStore' database of overlaps. If saveOverlaps=false, these are removed once the 'ovlStore' exists.

overlapConvert will read the ovb and output ascii overlaps. I suspect '-coords' will be the only useful output. NOTE that for each pair of reads, A and B, only ONE overlap is saved. It could be listed as either:

A B (overlap data)

or

B A (overlap data)

From the store, ovStoreDump will output the overlaps. It will report both the 'A B' and 'B A' overlap.