Puzzled with the error after running idba_ud

cczszb commented 6 years ago

Hi Yu Peng, The idba_ud tool is published by you.It's a very useful tool that I want use in my project.However,I got some problems in the process. My command is " idba_ud -l ../PE.fa --mink 27 --maxk 117 --step 10 --num_threads 2 --pre_correction -o idba_scaffold --no_bubble". there is a error after running ,the error is that "terminate called after throwing an instance of 'std::logic_error' what(): SequenceReader::SequenceReader() istream is invalid". Because it make me very puzzled now, I hope you can help me. I am looking forward to you reply.Thank you very much! Best wishes!

rotoscan commented 5 years ago

Hello @cczszb, I find myself having the same error.

I got the same error as you had and I will add more details to that below.

I performed a pre assembly (contiging) with MEGAHIT and used those contigs as inputs for IDBA. I use a HPC cluster for my assemblies, therefore I have to request resources before usage.

For this assembly, I requested 350 hours of calculation, 20 computational cores and 14GB of memory (per slot, summing up to 280 GB).

On the std.err file I get the following:

$ more logs/idba.err 
terminate called after throwing an instance of 'std::logic_error'
  what():  SequenceReader::SequenceReader() istream is invalid
/usr/local/uge/8.5.5-1/default/spool/datascience1/job_scripts/4730942: line 18: 135570 Aborted             
    (core dumped) idba -l $1 --num_threads ${NSLOTS:-1} -o $2

On the std.out:

First arg: ~/libraries/A/megahit_pml/final.contigs.fa
Second arg: ~/libraries/A/idba
beginning idba scaffolding
Mon Jul  9 22:57:57 CEST 2018
number of threads 20
reads 0
long reads 2123607
extra reads 0
read_length 0
kmer 20
kmers 359656929 363756230
merge bubble 49493
contigs: 2411939 n50: 165 max: 2581 mean: 67 total length: 162066564 n80: 33
aligned 0 reads
confirmed bases: 0 correct reads: 0 bases: 0
kmer 30
kmers 333326542 330965296
merge bubble 7934
contigs: 559378 n50: 343 max: 4372 mean: 319 total length: 178960870 n80: 249
aligned 0 reads
confirmed bases: 0 correct reads: 0 bases: 0
kmer 40
kmers 168393798 167922123
merge bubble 165
contigs: 455324 n50: 376 max: 4372 mean: 365 total length: 166497354 n80: 273
aligned 0 reads
confirmed bases: 0 correct reads: 0 bases: 0
kmer 50
kmers 148093826 147704437
merge bubble 16
contigs: 414928 n50: 390 max: 4372 mean: 384 total length: 159692498 n80: 285
finishing idba scaffolding

My command was:

idba -l $1 --num_threads 20 -o $2

#$1 here stands for the First arg declared on the std.out (the MEGAHIT output)
#$2 stands for the Second arg declared on the same std.out

On my output folder I got:

brizolat@frontend2|09:17:24|Di Jul 10| ~/libraries/A/idba
$ ls -lh
total 4,0G
-rw-r--r--+ 1 brizolat eve_umbmsb    0  9. Jul 23:41 align-20
-rw-r--r--+ 1 brizolat eve_umbmsb    0 10. Jul 00:05 align-30
-rw-r--r--+ 1 brizolat eve_umbmsb    0 10. Jul 00:16 align-40
-rw-r--r--+ 1 brizolat eve_umbmsb    0  9. Jul 22:57 begin
-rw-r--r--+ 1 brizolat eve_umbmsb 253M  9. Jul 23:41 contig-20.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 194M 10. Jul 00:05 contig-30.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 178M 10. Jul 00:16 contig-40.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 170M 10. Jul 00:25 contig-50.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 168M 10. Jul 00:25 contig.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 691M  9. Jul 23:32 graph-20.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 193M 10. Jul 00:03 graph-30.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 169M 10. Jul 00:15 graph-40.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 160M 10. Jul 00:25 graph-50.fa
-rw-r--r--+ 1 brizolat eve_umbmsb 1,9G  9. Jul 23:06 kmer
-rw-r--r--+ 1 brizolat eve_umbmsb  776 10. Jul 00:25 log

It took around 88 minutes of calculation (01:27:55) and the maximum memory usage peak was of 27.761G. Therefore, the problem was not resources request. It was not time of calculation nor memory exceeded.

I don't have a clue of what is causing this error. I would deeply appreciate some help with that.

Thank you very much for the attention.

Best, Rodolfo

jvollme commented 5 years ago

I'm pretty sure that IDBA needs paired-end reads to work. As far as i understood it, the "long_read" argument of IDBA is only meant to supply additional long reads, in order to support the correct assembly and scaffolding of the paired short reads (and help out with finding overlapping k-mers if you choose a high k-mer length). You however only supply long reads, and that is not what IDBA expects as input...

But also it seems that you are trying to assemble the results of a previous k-mer based assembly in order to achive larger contigs. In theory that is of course possible, but doing so with a k-mer based assembler is not a good idea.

Firstly, all k-mer based assemblies split your assembly down to k-mer sized chunks and then assemble those, defeating the whole purpose of the previous assembly you did. Secondly, the way you try to supply the data, those contigs will be assembled without the paired end information. The paired end info helps k-mer based assemblers (such as megahit or idba) to resolve repetitive regions and decide which ambigeous parts it could safely assembly and which are too unclear and should be split. Without that info, you are likely going to assembly a lot of misassembled hybrids.

I would highly recommend to use a scaffolding tool that utilizes your contigs together with the original read pairs, instead. For example you could use SSPACE (paper here). But there are other scaffolders out there also...

EDIT: @cczszb It seems you thought you were already supplying paired reads (judging from the name of your long-read input file). However, idba treats any input via the "long_read" argument as "unpaired" (see also issue #11). meaning you have to supply pairs via "-r".

rotoscan commented 5 years ago

Hello @jvollme,

Thank you for the quick and very explanatory reply!

I did not occurred to me that -l flag was only meant to be used as an additional input. Your argument on k-mer based assembly makes a lot of sense. My rational was to supply a smaller number of sequences in order to decrease the time of computation.

I will try again with the paired-end reads and supply the contigs as additional information as you said.

Thanks again.

Best, Rodolfo

jvollme commented 5 years ago

@rotoscan, you're welcome. But do try out the scaffolding with sspace instead, also. It takes your contigs and your read pairs, identifies the most likely contig order and connects them (either filling the gaps with partially mapping reads or, if that is not possible, with "N"s. I really do not think that idba (with or without supplying paired-ends together with the long reads) is going to do what you want. The long-reads argument does not really use long reads as reference for scaffolding (as you'd suspect from the naming), but just as another source for counting kmers. It is really just a mis-labelled "single-reads" option. Since your paired end reads are by definition much more numerous than your assembled contigs, your supplied contigs will likely not change a thing to the positive.

Also, if it is only lower computational resources you are after, you could try digital-normalization (using the khmer suite or BBTools), before assembling with IDBA (for denovo assemblies IDBA is really quite good, it's just not meant for scaffolding contigs of previous assemblies).

Here is a short overview on de-bruijn-graph assemblers that may help to explain the problem (taken from this paper).

Sabrin2020 commented 1 year ago

@jvollme can you please help me understand where does the error from megahit here comes from https://github.com/voutcn/megahit/issues/348

loneknightpy / idba

Puzzled with the error after running idba_ud #42