Closed shanesturrock closed 4 years ago
Unfortunately, the closest possible restart point would be the beginning of repeat graph analysis (--resume-from repeat
). I definitely recommend to use the version 2.7.1 for that.
What was the bottleneck of your run that terminated? If you can post the log file, I might have some extra suggestions.
Best, Mikhail
Hi Mikhail, here's the log as it currently stands. FLYE 2.6 is still running but the step from 80% to 90% took three weeks to give you an idea of the performance we're seeing. The machine it is on has 72 cores (144 threads) and 6TB of RAM. The repeat stage has been running since December so it is a bit of a blow if we have to go back to then unless the 2.7.1 version moves much faster. The 20-repeat directory has no files in it as yet.
Here's stdout:
[2019-12-22 00:01:29] INFO: Computing consensus [2019-12-24 16:45:07] INFO: Alignment error rate: 0.104049 [2019-12-24 17:29:36] INFO: >>>STAGE: repeat [2019-12-24 17:29:36] INFO: Building and resolving repeat graph [2019-12-24 17:29:42] INFO: Reading sequences [2019-12-24 18:52:54] INFO: Building repeat graph 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2020-02-09 19:25:19] INFO: Median overlap divergence: 0.0609317 [2020-02-12 09:41:37] INFO: Aligning reads to the graph 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
flye.log has this but hasn't been written to since February:
[2020-02-12 09:30:05] DEBUG: Total edges: 1393150 [2020-02-12 09:35:12] DEBUG: Building positional index [2020-02-12 09:35:16] DEBUG: Total sequence: 25706452598 bp [2020-02-12 09:41:37] INFO: Aligning reads to the graph [2020-02-12 09:43:16] DEBUG: Hard threshold set to 1 [2020-02-12 09:43:16] DEBUG: Started k-mer counting [2020-02-12 10:05:47] DEBUG: Repetitive k-mer frequency: 778 [2020-02-12 10:05:47] DEBUG: Filtered 1716743 repetitive k-mers (0.00105381) [2020-02-12 10:06:07] DEBUG: Sampling rate: 2 [2020-02-12 10:06:07] DEBUG: Solid k-mers: 1627358628 [2020-02-12 10:06:07] DEBUG: K-mer index size: 8481063719 [2020-02-12 10:06:07] DEBUG: Mean k-mer frequency: 5.21155 [2020-02-12 10:37:50] DEBUG: Sorting k-mer index
I see. How big is the genome and what type of reads are you using?
Version 2.7 will eliminate on of the bottlenecks which you'd most likely faced in your run, but the compute time for read alignment should be roughly the same, unfortunately.
One thing to check is how many threads to use. We have realy used more than 50 on our hardware. It could be that at some point performance might stop scaling with threads increase and can actually go down. I'd check that Flye is utilizing all CPUs that you have requested at 100%. If not, it's possible that reducing the number of threads might actually help to make the assembly faster. Also, if you see that processes are spending a lot of time in system space, rather than user space, this could also be an indicator that the number of threads needs to be reduced. Finally, extra threads use more memory - check that you are not hitting the system limits. If there are other memory intensive software running on the same machine, it might also slow Flye down significantly because of memory fragmentation.
Best, Mikhail
Hi Mikhail, yeah, the genome is 25Gbp give or take. I had previously tried FLYE and you had baulked at the size and said it wouldn't work but that was a while back. I had been using MaSuRCA as the assembler but while that was running Aleksey had swapped out CABOG for Flye as the assembler and provided instructions on using Flye with the intermediate file from MaSuRCA which is what I did. Flye is certainly using all cores (I allocated 64 cores and I am seeing close to 6400% utilisation and 700GB of RAM usage (the machine has 6TB so not even close). Here's the command that Aleksey recommended using to run Flye with the error corrected data:
GS=cat ESTIMATED_GENOME_SIZE.txt
&&
If we're to restart the assembly with Flye it would also be helpful if you could comment on what Aleksey is recommending here in case there are further optimisations that will help the job complete in a reasonable time and with a good assembly.
Thanks. I actually do have some good news for the corrected reads assembly - I've been working on integrating minimizer indexing for this kind of input, which should improve speed and memory usage substantially. It's not quite ready yet, but in a couple of weeks we should have a first beta version.
In the mean time, if you have the resources, I'd suggest to run the latest release anyway. The command line looks good, but I suggest to use Flye 2.7.1 (--resume-from repeat
). Since the pipeline has changed a bit, you'd need to override its version in the params.json
file (change pipeline version from 2 to 3).
Hi Mikhail, thanks, the update sounds worth waiting for. We have to close the machine down in the near term so restarting will have to wait and hopefully your new release will be out by then. If it is still in beta when we move I'll be happy to try it and give feedback.
We've now completed our system move and updated all machines so we're ready to start running our assembly again. At the moment 2.7.1 still appears to be the latest but I would be keen to try your beta of the new release if it is ready to go?
@shanesturrock I would try the latest code from flye
branch, as it contains many relevant updates. The code should be relatively stable, and I expect to make a release in the near future. You can use --restart-from repeat
option. Let me know how it goes.
Thanks Mikhail, I'll get that going today.
Hi Mikhail, I've restarted having built from source and changed the json file to pipeline 3. I also allocated 128 cores (double my last run) and so far it is moving along at a much better pace. The "Building repeat graph" step took 6 weeks on 64 cores last time, this time it took 2 days so that's a substantial optimisation. The next step computing gluepoints took three days but that only used a single core and I'm expecting it to take about the same time. Aligning reads to the graph took four months so I'm hoping you've made similar optimisations to this step too. The code got stuck at the resolving repeats last time as that's what it was doing, and had been for around a month when we shut the machine down.
@shanesturrock thanks for the update - keep us posted!
One week in and it has progressed through 90% of aligning reads to the graph. This step started two days ago and looks likely to finish in the next day. With 2.6 it took four months on 64 cores. What I am noticing is that the number of repetitive K-mers filtered is substantially higher 2.6 - Filtered 1716743 repetitive k-mers (0.00105381), this version - Filtered 2008094843 repetitive k-mers (0.288154). That's undoubtably contributed greatly to the increase in speed but it will be interesting to see how that affects the number of reads aligned which I should know in a day or so.
It's now finished the aligning reads to the graph step which was very quick. I noticed that only about half as many reads aligned versus the last run but this may be intended as it certainly improves performance.
Stats for the old run: [2020-06-06 10:46:41] INFO: Aligned read sequence: 269105407044 / 277711296371 (0.969011) and for the new run: [2020-07-20 14:24:12] INFO: Aligned read sequence: 146639839881 / 277711296371 (0.52803)
The resulting dot file from 2.6 is 14M in size and the new one is 1.7M. Not sure how this will affect the final assembly but at this point I consider an assembly better than no assembly and we could probably tweak the settings and try again once this completes. The current step shows Simplifying the graph and has been running for nearly a day on a single core.
@shanesturrock thanks for the update!
Happy to hear that it ran much faster this time, but the amount of unaligned sequence is certainly unintended.. I haven't seen this on our datasets yet (but they are not super repetitive).
Wrt to k-mer filtering - I did change the filtering method slightly, but I also changed the way it's reported. Previously, Flye output the number of unique filtered k-mers, and now it outputs the number of filtered k-mer copies (or genomic positions). Since a few highly repetitive k-mers can contribute to many more positions in the genome, the numbers are inflated.
Do you mind sending me the log file of the current run? If you happen to have the logs from the previous runs - this will be very helpful too.
Best, Mikhail
My data set is around 80% repeats so has always been a challenge for assembly. Here are the log files.
@shanesturrock thanks, but this is just terminal output? Could you send flye.log files from the output directories?
They're 274MB compressed into a zip so I can't attach them here. See if this works: https://we.tl/t-0RfLfVe9ID
@shanesturrock thank you!
It seems that the problem is not in repetitive k-mers being filtered, but rather in error rate thresholds. -pacbio-corr
mode assumes 1-3% errors left in reads, but based on the distributions from the logs the error rate varies from 1% to 10% (3.3% mean). Previous versions of Flye did not implement error rate threshold cutoffs for aligning reads to the graph, hence the difference in the amount of aligned sequence.
MaSuRCA typically produces high quality reads, but I can see that it might be more challenging for highly repetitive genomes (less unique k-mers for initial mapping), and thus the quality of reads deteriorates. Ideally, given with these reads --pacbio-raw
mode should be used. It makes sense to wait until this run finishes though. So far, seems like the graph operations are being quite slow - let's see if they will complete within a reasonable time...
OK, that makes sense. I've done some error correction using FMLRC using our Illumina data against the PacBio reads and that doesn't seem to have error corrected all that well either. The simplifying graph step is toiling but I'm willing to let it continue. If I run with --pacbio-raw I assume I'll need to start completely again? Also, a more sensitive mode is going to take a performance hit I expect? We're under pressure to get this finished but the data just keeps fighting every step of the way.
If this is going to be the first assembly, it definitely makes sense to wait until the current run finishes. It might not be optimal in terms of contiguity, but I won't be "wrong" (e.g. contain major errors). Once you have a first iteration of the assembly, you can perhaps do another one with more optimized parameters.
Wrt to the performance - it may or may not be slower with --pacbio-raw
, as the new code includes some optimizations for disjointig assembly step as well.
Wrt to the current run - feel free to send me the updated log again if it'll be taking too long - will be happy to take a look on what's going on.
This will be the first completed with the PacBio data. We did an earlier assembly with just the Illumina reads using SOAPdenovo2 (the only program that could do it and not blow up due to the volume of reads) and that was highly fragmented (around 25 million contigs) and missing about 25% of the the expected sequence. The RAM usage of the current simplifying the graph step is climbing up past 2.3TB and still going. We have 6TB though but I'm hoping it will finish this step within our resources and move on.
One week running Simplifying the graph step and it has hit 50% of our RAM (3 out of 6TB). Is there any way to estimate progress and when it is likely to finish? It is only using a single CPU but the memory usage is a concern.
@shanesturrock definitely unexpected performance - could you send me the latest log?
Last thing written to the log was on the 20th so I believe you've got everything already since I sent the logs on the 21st. Here's the tail end of the log:
[2020-07-20 20:18:03] DEBUG: Mult: 473524 115998 17 (1,0) [2020-07-20 20:18:03] DEBUG: Starting -1822803 aln:21 minSpan:0 [2020-07-20 20:18:03] DEBUG: -764532 num:14 flank:11895 span:8 [2020-07-20 20:18:03] DEBUG: -764531 num:6 flank:2991 span:88 [2020-07-20 20:18:03] DEBUG: Mult: 1822803 144525 17 (1,0) [2020-07-20 20:18:03] DEBUG: Starting 1791243 aln:20 minSpan:0 [2020-07-20 20:18:03] DEBUG: -1645018 num:7 flank:16455 span:9 [2020-07-20 20:18:03] DEBUG: -1645017 num:8 flank:9304 span:9 [2020-07-20 20:18:03] DEBUG: Mult: 1791243 148811 12 (0,1) [2020-07-20 20:20:01] DEBUG: Writing Dot
and here's the tail end of the terminal output:
[2020-07-18 21:05:21] INFO: Aligning reads to the graph 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2020-07-20 14:24:12] INFO: Aligned read sequence: 146639839881 / 277711296371 (0.52803) [2020-07-20 14:24:12] INFO: Median overlap divergence: 0.0335728 [2020-07-20 14:26:07] INFO: Mean edge coverage: 19 [2020-07-20 14:26:35] INFO: Simplifying the graph
I see - this seems strange. Is it still writing the dot file? Could you post the output of ls -l out_dir/20-repeat
? Potentially, there might be an issue with the graph traversing function...
@shanesturrock and also do tail out_dir/20-repeat/graph_before_rr.gv
?
The dot file was last written to on the 20th too:
$ ls -l
total 1656
-rw-rw-r-- 1 shane shane 1691716 Jul 20 20:20 graph_before_rr.gv
$ tail graph_before_rr.gv
"35464" [style = "filled", fillcolor = "grey"];
"35465" [style = "filled", fillcolor = "grey"];
"35466" [style = "filled", fillcolor = "grey"];
"35467" [style = "filled", fillcolor = "grey"];
"35468" [style = "filled", fillcolor = "grey"];
"35469" [style = "filled", fillcolor = "grey"];
"35470" [style = "filled", fillcolor = "grey"];
"35471" [style = "filled", fillcolor = "grey"];
"35472" [style = "filled", fillcolor = "grey"];
"35473" [style = "filled", fillcolor = "grey"];
RAM use is at 3.2TB now.
I see. Looks like it is indeed stuck in the function that generates the Graphviz output. Usually it takes seconds, but seems that it might be extremely inefficient for very large graphs.
I've created a version with graphviz output disabled - the code is in the iss267
branch. It might makes sense to run it in parallel and see if it can catch up. The updated code should not use more than a few hundred Gb of RAM (theoretically).
OK, I'll kill off the current job because it is getting to a massive amount of RAM now (pushing on for 4TB and growing at an alarming rate) and grab the modified branch so I can restart with that. Thanks for looking at this, I'll keep you posted on how it goes.
I don't know if this helps but the previous build finally died with the following:
[2020-07-30 15:39:58] ERROR: Caught unhandled exception: std::bad_alloc [2020-07-30 15:39:58] ERROR: flye-modules(_Z16exceptionHandlerv+0xd0) [0x449700] [2020-07-30 15:39:58] ERROR: /lib64/libstdc++.so.6(+0x5e746) [0x7fee0c04d746] [2020-07-30 15:39:58] ERROR: /lib64/libstdc++.so.6(+0x5e773) [0x7fee0c04d773] [2020-07-30 15:39:58] ERROR: /lib64/libstdc++.so.6(+0x5e993) [0x7fee0c04d993] [2020-07-30 15:39:58] ERROR: /lib64/libstdc++.so.6(_Znwm+0x7d) [0x7fee0c04df2d] [2020-07-30 15:39:58] ERROR: flye-modules(_ZNSt6vectorIP9GraphEdgeSaIS1_EE19_M_emplace_back_auxIJRKS1EEEvDpOT+0x34) [0x4a9b14] [2020-07-30 15:39:58] ERROR: flye-modules(_ZN15OutputGenerator9outputDotERKSt6vectorI15UnbranchingPathSaIS1_EERKSs+0xd3a) [0x4f850a] [2020-07-30 15:39:58] ERROR: flye-modules(_Z11repeat_mainiPPc+0x1533) [0x501f73] [2020-07-30 15:39:58] ERROR: flye-modules(main+0x59) [0x443b69] [2020-07-30 15:39:58] ERROR: /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fee0b50f555] [2020-07-30 15:42:29] root: ERROR: Command '['flye-modules', 'repeat', '--disjointigs', '/raid/projects/active/nzgl01066/Flye-devel-assembly/Flye-2.6_Assembly-restart/flye_assembly/10-consensus/consensus.fasta', '--reads', 'mr.41.15.17.0.029.1.fa', '--out-dir', '/raid/projects/active/nzgl01066/Flye-devel-assembly/Flye-2.6_Assembly-restart/flye_assembly/20-repeat', '--config', '/raid/projects/active/nzgl01066/Flye-devel-assembly/Flye/flye/config/bin_cfg/asm_corrected_reads.cfg', '--log', '/raid/projects/active/nzgl01066/Flye-devel-assembly/Flye-2.6_Assembly-restart/flye_assembly/flye.log', '--threads', '128', '--min-ovlp', '2000', '--kmer', '17']' died with <Signals.SIGABRT: 6>. [2020-07-30 15:42:29] root: ERROR: Pipeline aborted
I'm about to restart with the iss267
branch
I made a mistake when checking out the special branch (assumed the page I was look at it on would give me a download link to it but I actually just got the master again) so I've lost a week. Checked out the right version and we're now passed the point where it broke last time and it hasn't written the dot file. It 's now reached [SIMPL] Split nodes and has been sitting there for three days running on a single core.
@shanesturrock I see, so there might be more bottlenecks remaining.. If it not progressing, could you send me / post the latest log?
It's currently using 1.4TB of RAM and 100% of a single CPU core. Here are the logs: https://we.tl/t-VOZsYSy8H1
Thanks again for this, very important that we get an assembly from this data and it is proving to be quite the challenge.
Shane
It's moved on so it isn't stuck. Here's the tail end of the log now:
[2020-08-21 06:32:54] DEBUG: Connection 1698173 -1689957 16 1 [2020-08-21 06:32:54] DEBUG: Connection -1426768 -1426764 32 1 [2020-08-21 06:32:54] DEBUG: Connection 754479 -1746757 8 1 [2020-08-21 06:32:54] DEBUG: Connection -754211 1812954 3 1 [2020-08-21 06:32:54] DEBUG: Connection 1118841 -646364 6 0.857143 [2020-08-21 06:32:54] DEBUG: Connection 551858 -1190347 9 0.947368 [2020-08-21 06:32:54] DEBUG: Connection -338 -337 10 1 [2020-08-21 06:38:05] DEBUG: [SIMPL] Resolved repeats: 216827 [2020-08-21 06:38:05] DEBUG: RR links: 4038352 [2020-08-21 06:38:05] DEBUG: Unresolved: 221
Still running single threaded, but still running so that's a plus!
Yay! It's a good sign, because now the pipeline will iterate a few simplification steps that you already passed - and every time they should be faseter computationally (at least, in theory).
Is it still using a lot of RAM?
Based on the logs, the current major slowdown is in the tip clipping function - I am planning to look into this in the near future.
Cool. Still sitting on 1.4TB of RAM which isn't too bad for our machine given we have 6TB available. A few days on a single core isn't a disaster but more speed is still welcome!
It's taken over three weeks now and it has iterated 9 times but SIMPL is still running. At iteration 9 I got this though:
[2020-09-10 16:45:42] DEBUG: [SIMPL] Resolved repeats: 0 [2020-09-10 16:45:42] DEBUG: RR links: 6054 [2020-09-10 16:45:42] DEBUG: Unresolved: 15 [2020-09-10 18:13:51] DEBUG: Removed 0 simple and 0 double chimeric junctions [2020-09-10 18:20:00] DEBUG: [SIMPL] Collapsed 23300 haplotypes
The haplotypes message is new this time so I'm wondering how much more it is likely to do before it finishes the 'Simplifying the graph' step and moves on to the next step?
@shanesturrock that's promising! It means that the simplification iterations are over. There are a couple extra steps left, but they should be much faster than each iteration. So - fingers crossed - the repeat stage should finish soon.
Another little step forward but it seems to be going slowly.
[2020-09-10 18:20:00] DEBUG: [SIMPL] Collapsed 23300 haplotypes [2020-09-14 00:57:35] DEBUG: Connection 1520075 -978212 1
That was about 8 hours ago and nothing more so it spent four days to write that.
Super slow indeed, but getting closer...
Would you expect it to write many more of these? The previous iteration had this:
[2020-09-10 16:45:05] DEBUG: Total unique edges: 669748 [2020-09-10 16:45:42] DEBUG: Connection 839639 800811 1 0.285714 [2020-09-10 16:45:42] DEBUG: Connection -787855 825944 3 0.428571 [2020-09-10 16:45:42] DEBUG: Connection 1787334 799158 1 0.5 [2020-09-10 16:45:42] DEBUG: Connection 1015377 43307 1 0.5 [2020-09-10 16:45:42] DEBUG: Connection 697723 1220999 2 0.5 [2020-09-10 16:45:42] DEBUG: Connection -1306174 968825 3 0.2 [2020-09-10 16:45:42] DEBUG: Connection -1335750 1060515 2 0.444444 [2020-09-10 16:45:42] DEBUG: Connection -780788 -257524 2 0.444444 [2020-09-10 16:45:42] DEBUG: Connection 103393 1019098 3 0.5 [2020-09-10 16:45:42] DEBUG: Connection -989847 -485123 2 0.4 [2020-09-10 16:45:42] DEBUG: Connection 922825 1147223 3 0.5 [2020-09-10 16:45:42] DEBUG: Connection -989260 -611808 3 0.352941 [2020-09-10 16:45:42] DEBUG: Connection -1148035 -536241 2 0.285714 [2020-09-10 16:45:42] DEBUG: Connection -1050063 944480 1 0.5 [2020-09-10 16:45:42] DEBUG: Connection -461201 -392922 1 0.5 [2020-09-10 16:45:42] DEBUG: [SIMPL] Resolved repeats: 0 [2020-09-10 16:45:42] DEBUG: RR links: 6054 [2020-09-10 16:45:42] DEBUG: Unresolved: 15 [2020-09-10 18:13:51] DEBUG: Removed 0 simple and 0 double chimeric junctions
Hoping it isn't going to write that many connections taking four days each.....
No, there are two a bit different repeat resolution algorithms - so connections should be different. Not sure why it's taking so long though..
Huge genome with lots of repeats. I'll let it run and will update you as it does more.
Thanks! Do you mind uploading the latest log again? I'll try to fix the bottleneck in the graph simplification iterations in case you need to restart the stage (hopefully not though).
Here you go: https://wsi.li/dl/NjiTrnQmgHCXFna3G/
Two more:
[2020-09-15 19:10:02] DEBUG: Connection -834547 61354 77 [2020-09-15 19:10:02] DEBUG: Connection -1099575 1099576 7
Still running.
@shanesturrock thanks for sending the log files!
Looks like there is an unexpected slowdown in the tip clipping function. E.g., Flye took about a week to build the graph and align reads, but the reset of ~3.5 weeks of graph simplifications were spent inside the tip clipping procedure!
I've just made some optimizations, and I expect the graph simplification to proceed much faster for you now. E.g., the graph simplification should theoretically run within a day with this update (instead of weeks). I've put the updates to the iss267
branch that you are using, and also disabled the extra repeat resolution function that is currently causing troubles.
If you have resources, I would recommend to run the updated version in parallel. It should take about a week to reach the graph simplification phase, but then go much faster.
Thanks for doing this, really appreciated. I'm making a duplicate of the current working directory and I'll run this latest version in there. Plenty of CPU and RAM to spare so it shouldn't be a problem.
OK, so both runs have progressed. The updated version is working through the SIMPL steps again but the previous version has written the contigs.fasta file which is exciting. However, when I check using bbtools stats.sh, there only appears to be about 8Gbp total contig length down from 22Gbp prior to the repeat resolution and the genome is expected to be around 25Gbp. It looks like the assembly has collapsed due to the repeats. Given that the input data came from MaSuRCA and we have already discussed running the job again using the raw pacbio I wanted to get your advice on this and if helpful I can again provide the logs. I'll let the jobs continue for now.
Another update and a question. I've looked at the stats for the consensus.fasta file in 10-consensus and that's showing around 22Gbp but the graph_before_rr.fasta in 20-repeat is only 9Gbp and repeat_graph_edges.fasta is 22Gbp. In 30-contigger, the contigs.fasta is 8Gbp and it has written the graph_final.gv and scaffolds_links.txt but is now stuck writing the dot file (I assume this is the gfa where it is somewhat expected to fail.)
So, I was hoping that the 9Gbp that went into the repeat resolution and then into the contigs wasn't all the sequence that will end up in the final assembly and the missing sequence will be pulled from the consensus and that if it is stuck at the gfa writing stage I can kill it and restart from the polishing step? Somewhat stressed that the assembly appears to be lacking so much of the expected genome sequence after the consensus looked so promising.
We're needing to restart the machine I've been running FLYE 2.6 on for a few months and it just finishing up the "Aligning reads to the graph" step. I'm also aware there's a bug with large repetitive genomes which is fixed in 2.7.1 so would I be able to stop and resume using the newer version to avoid the problem with the slow writing of the gfa file?