miniasm for large genome with daligner overlaps

MichelMoser commented 8 years ago

Hi, I just ran a falcon assembly on a large plant genome (1.4 Gb) using 50 smrtcells. I kept the .las files and was wondering if i could use them with your new las2paf.pl script. Or would you rather compute the overlaps from scratch?

Is a similar thing possible from Celera/PBcR assemblies? So far i couldnt find overlap information stored for the error-correction.

I was also wondering what resources miniasm needs in terms of memory and time to finish on a large dataset. Or would you advise miniasm not to use on such data yet? Thanks, Michel

lh3 commented 8 years ago

da2paf.pl also needs DB files to get the read lengths as I don't know for now how to get length info from .las files. If you have both .db and .las files, you can convert las to paf with:

ls *.las | xargs -i LAdump -cd reads.db {} | da2paf.pl <(DBdump -rh reads.db) | gzip -1 > all-vs-all.paf.gz

You can also use minimap to compute overlaps. It should be much faster than daligner with RAM capped at 20-30GB by default. I don't know how much RAM miniasm may take. It depends on how much repeat the genome has. I haven't tried miniasm on genomes larger than ~100Mb.

As to PBcR, do you want to use PBcR overlaps for miniasm or to use daligner overlaps for PBcR?

MichelMoser commented 8 years ago

Thank you, i would like to use PBcR overlaps for the miniasm. Is it possible?

As for using minimap for overlap computation, i dont know if you loose alot of overlaps because it is not as sensitive as daligner or mhap. Thats why i would rather go for mhap or daligner.

lh3 commented 8 years ago

daligner at the default setting is more sensitive than minimap, but mhap might not be. Between minimap and mhap, miniasm usually works better with minimap.

PBcR/Falcon both do two rounds of all-vs-all read alignment: between raw reads and between error corrected reads. PBcR uses mhap for raw reads overlaps and a celera overlapper for corrected reads, if I am right. I think PBcR throws away overlaps by default, so you can't feed PBcR overlaps to miniasm.

MichelMoser commented 8 years ago

Thank you, thats also what i have heard about the two assemblers.

One last question: Is there the possibility to create a consensus step after assembling with miniasm? I thought about remapping reads to the assembly using dalinger and do some sort of consensus calling. Or maybe use blasr and quiver although i dont know if blasr can handle the still high amount of errors between reads and assembly.

lh3 commented 8 years ago

I know someone has been able to run quiver on miniasm assembly, but I have not done that myself, so I am unable to provide more information. Sorry.

MichelMoser commented 8 years ago

Can you give me the email or name of that person so i could contact him? my email would be: michel.moser @ ips.unibe.ch Thank you!

MichelMoser commented 8 years ago

Hi, another short question about miniasm: is the whole file of reads.paf.gz actually loaded into memory for processing? If this is the case i will struggle to assemble my 406 Gbyte paf.gz file :)

lh3 commented 8 years ago

Ask @lexnederbragt about running quiver on miniasm output.

When reading paf.gz, miniasm applies additional filtering, uses binary encoding and converts read names to integers. These make the memory footprint much smaller than the the uncompressed paf file. That said, paf.gz is compressed. I don't know how much RAM miniasm will take in your case. Better try and prepare to kill miniasm if it takes too much RAM.

BTW, is this paf.gz converted from the daligner output?

lexnederbragt commented 8 years ago

On running quiver for large genomes with many SMRTcells: see https://github.com/PacificBiosciences/pbalign/wiki/Tutorial:-How-to-divide-and-conquer-large-datasets-using-pbalign. Quiver is then 'simply' run as:

quiver -j 24 out.cmp.h5 -r ../unpolished.fasta -o variants.gff -o consensus.fasta

(-j sets the number of cores quiver can use)

iqbal-lab commented 8 years ago

When you say 'simply' - is it a long compute job?

MichelMoser commented 8 years ago

miniasm was killed. probably using to much memory. i used minimap to create the paf.gz. But will soon have the .las files from daligner and falcon (currently still running).

@lexnederbragt i am just worried to create the out.cmp.h5 files using the low-quality.fasta assembly from miniasm. Can blasr actually handle alignments of pacbio-reads vs pacbio-reads? Because thats basically what you do if you try to feed miniasm-assembly to quiver.

lh3 commented 8 years ago

@MichelMoser Thanks a lot for trying! I probably need to add a repeat filtering step. I suspect most hits are repeats. How long does minimap take to find overlaps? I have not tried it on such large data set, so would like to get a sense about the speed. Thanks again.

MichelMoser commented 8 years ago

It took minimap about 18 hours to find all the overlaps between 5,852,650 reads (52.6 Gbases). Thats really quick! Could you parse the paf.gz in junks for the first step of overlap-filtering? I would really like to make this work with my data.

lh3 commented 8 years ago

@MichelMoser Thanks a lot. In minimap output, the hits of a read are not contiguous in the output file. Miniasm has to sort all hits of a read together. It currently uses an internal sort to achieve this. That is why it has to load all hits into RAM. The following steps before graph cleaning don't need to load all hits into RAM in principle, but making all these happen require quite a lot of code refactoring.

I will certainly think about how to make miniasm work with huge genomes, but unfortunately, it won't happen soon. Sorry. Miniasm started out as an experiment, when I had little idea about whether correction-free assembly would work in the first place. I had not paid enough attention to scalability.

lexnederbragt commented 8 years ago

Things are never simple in bioinformatics...

I successfully polished a miniasm E coli contig with pbalign and quiver. It took two rounds to get to a (pretty) decent quality. I hope to share detail soon(ish). But larger genomes may be more difficult (due to long repeats, for example).

gringer commented 7 years ago

@lh3 the linux coreutils sort command creates temporary files when the sorted structure cannot be fit into memory; it's also pretty fast. Have you considered using that, or implementing a similar sort algorithm?

lh3 / miniasm

miniasm for large genome with daligner overlaps #3