mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
785 stars 167 forks source link

returned non-zero exit status -7 #44

Closed HTaekOppa closed 6 years ago

HTaekOppa commented 6 years ago

Dear Flye,

Hope this email finds you well. While I was testing the program for a PacBio data (genome size 1.5Gb) in PBSpro environment, I have bumped into the same issue constantly with the “returned non-zero exit status -7". FYI, please see below for the output file.

Looking forward to your reply! Flye_026T_Output.txt

Regards,

Taek

mikolmogorov commented 6 years ago

Hi,

Looks like there were some issues with repeat graph construction. If you could send me the full log file (flye.log) I should be able to tell more. Also, how much memory does the machine have - could this be an issue? Also, what kind of genome is it - are there many repeats? What is the coverage?

HTaekOppa commented 6 years ago

Hi,

Thank you for the email. I have tried again but it was the same error. I have attached the log file as you requested.

Re memory, I have requested 1.4T memory but it has used only 600Gb. Not sure why. Our Univ HPC is capable of offering a few terabytes memory node. Re repeats, yes. As other plant/crop species, the targeted one also has high repeats (e.g tri and tetra-ploidy with 2Gb ~ 3.5Gb genome size). The PacBio depth was around 60X for testing Flye.

Based on my experience, while this Flye assembler is working fine for small to mid-size genomes (less than 1Gb), it has been problem for large genome organisms (despite ploidy issues).

Looking forward to your reply!

Cheers,

Taek flye_log.txt

mikolmogorov commented 6 years ago

Hi,

Thank you. Looks like the last log shows out-of-memory issue. Interestingly, it happened after the point where all major memory allocations are made - so probably if you had extra 50-100G, it would work.

For a 30x human dataset Flye required ~700G of RAM. We are aware of this bottleneck and working on reducing the memory footprint, but it would require some significant changes to the algorithm. In the mean time I would recommend to try to downsample the input reads to, say, 30-40x coverage - so the contig assembly stage will have enough RAM. Afterwards, you might run the repeat resolution stage with the full set of reads using "--resume-from repeat". You might also double check if the clusters gives you the right amount of RAM you requested.

HTaekOppa commented 6 years ago

Hi,

Unfortunately, there has been no luck. Re 30-40X coverage, I have tested with smaller data and genome size and it has worked fine. Re full coverage and larger genome, it has worked until "1-consensus" but failed in "2-repeat". Re RAM requested, while I have requested enough RAM (more than 700Gb), the flye command has been stopped after using only 400~500Gb. After having a chat with our Univ. HPC team, they said like "Looks like his program is designed to work well on smaller assembly jobs, and he hasn't thought too hard about minimising the memory footprint. I don't think there is anything we can do to make it work other than what the developer says."

I think we need to fix the fundamental problems to cope with large genome and high repeat sequence.

Any idea?

mikolmogorov commented 6 years ago

Hi,

Do I understand correctly - you have tried 1.5Gb genome downsampled to 30-40x coverage and it worked, but the assembly of the original dataset at 55x failed during repeat analysis? Do you have the log for the last assembly?

HTaekOppa commented 6 years ago

Hi,

I have uploaded the log from the last assembly (--resume-from repeat). I might have deleted the log file for the down-sampled because I need to see the assembly outcome from the full original dataset (55X). Any suggestion would be greatly appreciated. flye_CqT_repeat.txt

mikolmogorov commented 6 years ago

Hi,

Based on the log, it seems that it ran out of memory during the realignment of reads back on the graph. Usually it is not a memory bottleneck, but probably in case of repeat-rich genomes it could be. We are working on making Flye more memory efficient in general, but it probably will take some time.

I think the fastest solution for you right now would be to re-run repeat module with reduced number of threads (say, 7-8). This should reduce the memory usage and allow to go through the read realignment stage. Once the repeat analysis is complete (and polishing begins), you might stop it and then resume (--resume-from polishing) with the desired number of threads (polishing should not require a lot of memory).

Sorry for this struggle - it looks like there are some unforeseen bottlenecks in assembly of large and repeat-rich genomes. There are not too many datasets of this kind currently available, so it is hard to predict how they will behave.

HTaekOppa commented 6 years ago

Hi,

After re-running with reduced number of threads, the "2-repeat stage" was successful (hope so). And it continued in the "3-polishing stage" until the error message came up. I have attached the error and log file. FYI, I have not tried the resume with "--resume-from polishing".

Given the circumstances, do you think it would be OK to resume with "--resume-from polishing"?

Apart from this, how long it would take to fix the bottlenecks for large and repeat-rich genomes (memory issue)? There are two more genome data are on my list (2 times larger genome size with high repeat). It has been failing from the first "0-assembly stage" after using 1.3Tb memory.

Regards,

Taek Flye_026T_Polish_Stage.txt flye_log_Polish_Stage.zip

mikolmogorov commented 6 years ago

Almost there :)

Please try the latest version from the 'flye' branch - I recently pushed some fixes for the bubble generation stage - hope this helps. I don't think it's in anaconda yet - so you will need to install it (let me know if there are any issues). You can use '--resume-from polishing' now to restart (or simply --resume).

We are working on the memory usage reduction - but it is hard to predict how much time it's going to take, since some algorithms are completely different. Maybe we will have some kind of beta version in a month or so.

HTaekOppa commented 6 years ago

Hi,

Just wondering whether you have resolved any issues relating to “returned non-zero exit status -7”. flye_log.txt Flye_SR849.txt

I have reinstalled from Bioconda (2.3.4-release) and tested it again for PacBio Maize B73 data under PBSpro system. The genome size was around ~2.5Gb. After running 213 hours (please see the attached file), it has stopped again due to the same reason. While the requested Memory was 1.6Tb (from 6Tb memory node), the Flye used only 1.4Tb and stopped (see the error message in line 17-18). Just in case, I have also uploaded the log file, too.

While the Flye takes some time to assemble the data, the overall contiguity and completeness (BUSCO) was really promising. Thus, I do want to see the final assembly for this PacBio Maize B73 to understand whether the Flye is able to handle the big and complex genomes. I am expecting to run it with 4G Genome Size Allotetraploid, soon.

Cheers,

mikolmogorov commented 6 years ago

Hi,

First off, were you able to finally polish you previous run?

There were no updates regarding memory usage in the latest version. There is some ongoing work, but in the mean time I would recommend to follow the same strategy: assemble with 30-40x longest reads, then run repeat resolution / polishing with the full set of reads. 100x seems a lot for the initial stages. I will also try to add this functionality (taking top X% of reads for assembly) in the next version.

HTaekOppa commented 6 years ago

Hi,

Re polish, it won't be a problem (haven't done yet) because I am going to use Illumina data. Have tried on other data set and it has worked fine.

Re Memory usage issue, it is more burning request to you. My real data set (genome size 3~4Gb with 50~100X) will be bigger than the current testing one (genome size 2~2.5Gb with 50~100X) and will be more complicated (genome complexity due to polyploids). While I can see your logic behind starting from a smaller data set (initial 30~40X depth/coverage), this does not look an ideal way for me because the left over data cannot be used. Thus, I am requesting you whether you could fix this issue (functionality) in Flye any time soon. Or if you do not mind, could you please test it (initial 30~40X depth/coverage) from your end? You can download the data from https://www.ncbi.nlm.nih.gov/sra/?term=SRX1472849.

Re next version, do you have any idea when it would be? While I do love to test your program (Flye) for my real data, if the release of new version takes longer than my expectations, I might have to look for alternative options (other assemblers).

Regards,

mikolmogorov commented 6 years ago

I know we all have high expectations, but I can't give any estimates right now because it is not a question of implementation, but rather a new algorithmic problem we need to solve. It is always a good idea to try all available assemblers and find out which one works best for you.

mikolmogorov commented 6 years ago

FYI, we have just pushed an updated version 2.3.6b into the 'flye' branch (it is not in the releases yet). It now consumes ~30% less memory on large genomes, and you can further decrease it by setting '--asm-coverage' argument, which controls how much sequence will be used in the initial contig assembly (I recommend to set it to 30 for your high-coverage assemblies).

HTaekOppa commented 6 years ago

Thanks a lot!

Re 2.3.6b, is this the newest version that can cope with the "returned non-zero exit status -9"? Re --asm-coverage, I can try it using the 2.3.6b if this is the right version. However, is there any further info/link I can see the full parameters and options? In particular, I would like to use a cufoff_length option (e.g. 1,000 bp) if available for Flye.

Regards,