marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
294 stars 29 forks source link

Assertion error in MBG #126

Closed ad3002 closed 1 year ago

ad3002 commented 1 year ago

I got some assertion error with verkko from conda (default parameters running with HiFi + Oxford Nanopore):

176456 unitigs after resolving
Building unitig sequences
Reading sequences from ../hifi-corrected.fasta.gz
MBG: src/ConsensusMaker.h:65: void ConsensusMaker::addStrings(size_t, size_t, size_t, F) [with F = addCounts(ConsensusMaker&, const SequenceCharType&, const SequenceLengthType&, const string&, size_t, size_t, size_t, size_t, size_t, bool)::<lambda(size_t)>; size_t = long unsigned int]: Assertion `compressedSequences[realUnitig].get(realOff) == 0 || compressedSequences[realUnitig].get(realOff) == compressed' failed.
Aborted (core dumped)

Here https://github.com/maickrau/MBG/blob/4c5f27cc6c17369706d5a697687d715deb8b657d/src/ConsensusMaker.h#L65

skoren commented 1 year ago

This seems the same as #97 and #83 and is caused by a duplicate read in the MBG input. We haven't gotten a test set to reproduce locally. Could you check your hifi-corrected.fasta.gz for duplicate reads as well as the original input HiFi data? Are you able to share your input HiFi data to reproduce the error locally, if so, see this page for info on how to send us the data: https://canu.readthedocs.io/en/latest/faq.html#how-can-i-send-data-to-you

ad3002 commented 1 year ago

Yes, there were several duplications in names/sequences. And I found them in a BAM file provided by the sequencing facility too. I removed them and the verkko finished without any errors. I think an easy solution would be to add a warning about potential duplicates to the README. A slightly more difficult solution would be to implement a sanity check for duplicates in the input reads.

skoren commented 1 year ago

Note that verkko only keeps the read name up to the first space so if there is a difference after the space in the read names, they will still look like duplicates.

I just pushed a fix to check for this in deaad6e. This should be in the next release. I am also going to close this issue along with #97 and #83 but please re-open if you see cases where there were no duplicates in the input that hit this error. Note