loneknightpy / idba

124 stars 53 forks source link

idba_hybrid caused 'std::logic_error' #26

Open mictadlo opened 7 years ago

mictadlo commented 7 years ago

Hi I tried to run idba_hybrid in the following way:

$ fq2fa --paired out_mit.fq out_mit.fa
$ idba_hybrid --reference mit.fasta -l out_mit.fa -o `pwd`/idba-mit --num_threads 10 --pre_correction --maxk 124 --step 5

but I received the following error:

terminate called after throwing an instance of 'std::logic_error'
   what():  SequenceReader::SequenceReader() istream is invalid

What did I do wrong?

Thank yo in advance.

Mic

mooreryan commented 7 years ago

The error that you see is raised when the program has a problem opening an input file. Check your input files (mit.fasta and out_mit)....Are they actually fastA? Are they zipped?

Also, the -l option is not meant to be used with paired reads I'm pretty sure.

Thomieh73 commented 7 years ago

Hi, I am having the same problem, but it only appears after several rounds of making assemblies with different Kmers. So my conclusion is that the problem is something else than the formating. But correct me if I am wrong here.

Why I think the format is not the problem. The data I used is some old 454 data ( not paired-end). Since they were 454 sequences in sff format, they contained a lot of errors and ambigous nucleotides, so I did quality trimming with BBduk (http://seqanswers.com/forums/showthread.php?t=42776) inside of Geneious. than I translated those sequences with geneious into fastA. With these cleaned sequences I ran idba_ud with the following command:

docker run -v pwd:/Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff -w //Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff loneknightpy/idba idba -l FFNO7VY01_trimmed.fasta -o output --num_threads 4

The the last bit of the screen output:

kmer 40 kmers 5580373 5583550 merge bubble 2 contigs: 2232 n50: 7603 max: 59605 mean: 2454 total length: 5478509 n80: 2767 aligned 0 reads confirmed bases: 0 correct reads: 0 bases: 0 kmer 50 kmers 5560890 5562784 merge bubble 3 contigs: 3030 n50: 6389 max: 67446 mean: 1832 total length: 5552969 n80: 2428 terminate called after throwing an instance of 'std::logic_error' what(): SequenceReader::SequenceReader() istream is invalid

This error is consistent. I noticed that it comes after the maximum kmer has been completed with my data and setup. If I set --maxk 50 or --maxk100, then the error occurs after processing that kmer size.

the output folder contains contig amd graph files in fasta format. but it also contains files like "align-30". Those are empty files. Is that correct? could it be that the aligning/mapping is not working in my setup?

Thomieh73 commented 7 years ago

Hi, I tried running idba_ud with a different dataset. This time I used a fastq dataset produced using paired-end MiSeq.

The commands I used were:

docker run -v pwd:/Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff -w //Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff loneknightpy/idba fq2fa --merge --filter Haverkamp-gDNA-1074_S3_L001_R1_001.fastq Haverkamp-gDNA-1074_S3_L001_R2_001.fastq TT_1074_read_pairs.fasta

docker run -v pwd:/Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff -w //Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff loneknightpy/idba idba_ud -l TT_1074_read_pairs.fasta -o TT_output --maxk 100 --step 20

This crashed with the error:

kmer 80 kmers 9933667 9968341 merge bubble 281 contigs: 19430 n50: 374 max: 8195 mean: 185 total length: 3611198 n80: 80 aligned 0 reads confirmed bases: 0 correct reads: 0 bases: 0 distance mean -nan sd -nan invalid insert distance kmer 100 kmers 6931669 6947769 merge bubble 1189 contigs: 395 n50: 68041 max: 180769 mean: 4771 total length: 1884724 n80: 39795 terminate called after throwing an instance of 'std::logic_error' what(): SequenceReader::SequenceReader() istream is invalid

Checking the idba output I noticed that it did not align reads, and it did not calculate the insert distance. I wondered if this was due to not using short reads (< 128 bp) So I took my raw MiSeq fastq dataset, and processed it in geneious to remove adapters, ambigous bases and short sequences less than 100bp. That gave me a dataset with read between 100 and 251 bp. Next I trimmed all the reads down to 125 bp by trimming of the 3' end. This short Paired-end dataset was then used in idba_ud.

Commands:

docker run -v pwd:/Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff -w //Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff loneknightpy/idba fq2fa --merge --filter Trimmed_Ts_1074_S3_L001_R_0011.fastq Trimmed_Ts_1074_S3_L001_R_0012.fastq TT_1074_read_pairs.fasta

docker run -v pwd:/Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff -w //Users/Thomieh_cloud/Data_Thomas/Bacillus_genomes/sff_files/AH725_sff loneknightpy/idba idba_ud -r TT_1074_read_pairs.fasta -o TT_output --maxk 100 --step 20

The output I got was:

kmer 80 kmers 1807416 1807454 merge bubble 8 contigs: 93 n50: 167003 max: 352078 mean: 19367 total length: 1801171 n80: 61564 aligned 1955299 reads confirmed bases: 1788622 correct reads: 1938225 bases: 149 distance mean 272.586 sd 87.5581 seed contigs 56 local contigs 186 kmer 100 kmers 1796688 1796705 merge bubble 4 contigs: 78 n50: 180420 max: 352118 mean: 23104 total length: 1802142 n80: 68041 reads 1975930 aligned 1956232 reads distance mean 272.636 sd 87.6131 expected coverage 1.09151 edgs 35 contigs: 72 n50: 180420 max: 352118 mean: 25017 total length: 1801257 n80: 68041

This finished without any errors

So the error is due to the lack of short reads. When those are not provided, idba will not work properly and will crash with the error, since it is expecting the short reads for the final steps of the assembly

mooreryan commented 7 years ago

FWIW, I managed to recreate the error in the manner you described. IDBA needs paired reads. Long reads don't need to be paired, but you at least need some paired reads it seems.

I saw that you trimmed the short reads so they would be within the limit for the -r option. You can modify kMaxShortSequence in the short sequence header file, then recompile the software so that you can use longer reads.

Thomieh73 commented 7 years ago

Thanks, I will try out the recompiling of for larger lengths.

jordanashworth commented 7 years ago

@Thomieh73 I'm having the same problems. Did recompiling for longer lengths work for you?

skerker commented 7 years ago

Just to clarify - do you think it would work with the -l option if I have PE150 reads. Or maybe I should just trim my reads down so I can use the -r option instead of -l

Thanks, Jeff

mooreryan commented 7 years ago

@skerker I wouldn't trim down your reads so they fit in the 128 base limit...it would eliminate the benefits of having longer reads. Rather recompile the program with the change in kMaxShortSequence as mentioned above and use the -r option.

skerker commented 7 years ago

Thanks for the help. I'll recompile and give it a try using PE150 and -r

Young331 commented 5 years ago

Hi, I tried to run idba_ud after changing 128 to 150 in kMaxShortSequence as mentioned above and recompiled it. But when I use the following command line, it still occured an error. From the erro information, I can see " fasta read file (<=150). But it stil shows maxk is too large. Is there any other parameters need to be changed?

The commands I used were: $idba_ud -r DNA-18_merge_idba.fa -o assembly_18_idba_output --mink 79 --maxk 149 --step 10 --min_contig 1000 --num_threads 72

This crashed with the error: maxk is too large IDBA-UD - Iterative de Bruijn Graph Assembler for sequencing data with highly uneven depth. Usage: idba_ud -r read.fa -o output_dir Allowed Options: -o, --out arg (=out) output directory -r, --read arg fasta read file (<=150) --read_level_2 arg paired-end reads fasta for second level scaffolds --read_level_3 arg paired-end reads fasta for third level scaffolds --read_level_4 arg paired-end reads fasta for fourth level scaffolds --read_level_5 arg paired-end reads fasta for fifth level scaffolds -l, --long_read arg fasta long read file (>150) --mink arg (=20) minimum k value (<=124) --maxk arg (=100) maximum k value (<=124) --step arg (=20) increment of k-mer of each iteration --inner_mink arg (=10) inner minimum k value --inner_step arg (=5) inner increment of k-mer --prefix arg (=3) prefix length used to build sub k-mer table --min_count arg (=2) minimum multiplicity for filtering k-mer when building the graph --min_support arg (=1) minimum supoort in each iteration --num_threads arg (=0) number of threads --seed_kmer arg (=30) seed kmer size for alignment --min_contig arg (=200) minimum size of contig --similar arg (=0.95) similarity for alignment --max_mismatch arg (=3) max mismatch of error correction --min_pairs arg (=3) minimum number of pairs --no_bubble do not merge bubble --no_local do not use local assembly --no_coverage do not iterate on coverage --no_correct do not do correction --pre_correction perform pre-correction before assembly

th-of commented 5 years ago

You have to raise the value by 2^x, so try 256.

th-of commented 5 years ago

I made a mistake in my previous reply, change the value to 256 (2^8). Does it work now?

mooreryan commented 5 years ago

Just checking...does the $idba_ud variable point to the newly compiled version that you compiled after changing the max kmer size in the header file, or does it still point to the original version of idba_ud program?

Young331 commented 5 years ago

I'm sorry that I misunstood 2^x. I tried 256 again. Unfortunately, it doesn't work. I'm sure that it should be the newly compiled version. Becasue I deleted all related files before I reinstalled and compiled it. And allowed options also show "fasta read file (<=256)".

I'm sorry I can't figure out the reason. I will be very thanksful if you could make a test with my file. My fasta file is very large so that I splited a small part of the fasta file(DNA-18_merged_part.fa.gz) . Thank you in advance. https://drive.google.com/drive/folders/1kradtYYqpFU1ARgzGclC8Dx7Us0vbati?usp=sharing

mooreryan commented 5 years ago

Ohh, I know what the problem is. The --maxk parameter is too large. The kMaxShortSequence variable doesn't control that. That is controlled by this line: https://github.com/loneknightpy/idba/blob/a1bafe6b012912cd9a76926ec75a98aee6213af6/src/basic/kmer.h#L200.

You should either reduce the --maxk parameter, or you could change that line to something else (8 would work), then recompile the program. Not sure if this is a good idea though...pretty sure it will only affect the amount of memory used, but I haven't gone through the code in depth, so you may want to just reduce --maxk instead.

jvollme commented 5 years ago

Changing the kNumUint64 parameter to increase the maxk-limit was also described here: https://groups.google.com/forum/#!topic/hku-idba/p8YpZL46dtI. Linking it here for completeness sake.

Young331 commented 5 years ago

Got it!Thank you very much!

Thomieh73 commented 3 years ago

Thanks, I will try out the recompiling of for larger lengths.

I forgot all about this issue. I consider it done. If needed you can close the issue now.