ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
178 stars 71 forks source link

Stuck at seed retrieval with v2.3.2 and v2.4 #10

Closed charlesfoster closed 7 years ago

charlesfoster commented 7 years ago

Dear Nicolas,

I can see that a similar issue with seed retrieval was raised before, but I thought I should raise a new issue since this is for a new version of the program.

I have used NOVOPlasty for 20 different sets of reads now. It works for 14 of them (although a nice full genome is only assembled for one or two), but for the other 6 it gets stuck at seed retrieval. Although you have noted that this is sometimes caused by unsupported read formats, I don't think this is the case here since all 20 sets of reads came from the same Illumina run, and all are in the same format. The settings I use are:

Project name = Pimelea_45 Insert size = 500 Insert size aut = yes Read Length = 150 Type = chloro Genome Range = 120000-200000 K-mer = 39 Insert Range = 1.5 Insert Range strict = 1.2 Single/Paired = PE Coverage Cut off = 1000 Extended log = 1 Combined reads = Forward reads = 45_1.fq Reverse reads = 45_2.fq Seed Input = ChloroSeed.fasta

Unfortunately I can't provide an error message since the program doesn't provide one, unless I'm missing something - the logfile remains empty since it's stuck at seed retrieval. Any suggestions?

Otherwise, thanks for creating a great program.

Cheers, Charles

ndierckx commented 7 years ago

Hi,

Thanks for your comment, the only other problems could be too low coverage or that the seed is too distant, but the last one is unlikely So how high is your coverage?

If it's too low I would recommend to lower the kmer to 23 and maybe insert range to 1.6 (sometimes the libraries insert size is very variable). If you are convinced you have sufficient coverage, you could me send me one of your datasets so I can look for the problem myself.

Greets

charlesfoster commented 7 years ago

Hi,

I've tested a few seeds and have the same issue, so it probably isn't that. For the attempts that got stuck on seed retrieval, the coverage ranges a fair bit: the lowest average coverage is 6.8X; the highest is 41X (as estimated with samtools). The most successful assembly where the seed retrieval was fine had an average coverage of just over 200X.

I tried changing the parameters as you suggested, but unfortunately I was still stuck on seed retrieval. I will email you the reads as you suggested - thanks!

Cheers, Charles

ndierckx commented 7 years ago

Hi,

The dataset you send me works for me with 2.4, at least the assembly starts (no seed problem) It terminates at 1193 bp, but that's because of the extreme low coverage.

You only have 8534 reads, that's not nearly enough to assemble a chloroplast genome! And I am guessing it's capture DNA, while NOVOPlasty is designed for WGS data, because capture DNA often contains gaps, unless it's designed very well.

charlesfoster commented 7 years ago

Thank you for testing that for me. I'm not sure why the seed retrieval didn't initially work for me. However, I have trialled it again now with a different seed again and this time it ran for me - strange.

The file I sent was from my worst sample, because I thought that if that one could work then any others should. Out of interest, what is the minimum amount of reads that would you recommend for a successful assembly with NOVOPlasty?

Apart from that question, that's all from me - so thanks again!

ndierckx commented 7 years ago

In theory coverage 6X could be enough, but then it should be equally distributed. This is often not the case, so I would recommend 30X, but higher is still better. You could try different runs with different seeds if you have only a few coverage gaps, if there are too many gaps I would recommend a graph based assembler.

And since you have filtered data, you could try maybe assembling each chloroplast from the complete pooled dataset, it should work if the different taxa aren't too close.

And do your previously failed datasets still have SEED problems?

charlesfoster commented 7 years ago

Great, thanks for the advice. I will try using a graph-based assembler for some of the worse data sets, as well as trying assemblers on the entire pooled data. The other previously failing datasets are now working after choosing an appropriate 150 bp read from within the fq file. Thanks.

ndierckx commented 7 years ago

Graph based assembler will just give you small contigs so don't hope for a whole genone but you will have something at least. I would try novoplasty on the whole dataset if you have a unique aeed for each taxa. Although if you had a good reference to filter the data, the results will not improve i guess