runtime estimation - Githubissues

dcopetti commented 5 years ago

Hi, I am running HG-CoLoR on a local machine, and now it is running since about a week: is that normal?

HG-CoLoR --bestn 15 --kmcmem 90 --nproc 18 --longreads ../ONT_181026.fa --shortreads PE4702_200bp_interlaced.fq --out ONT_HC-CoLoR_PE470.fasta --tmpdir /data/dario/ONT_data/HC-CoLor_correction
[Thu Dec 20 15:47:17 CET 2018] Correcting the short reads
[Thu Dec 20 19:25:49 CET 2018] Removing short reads containing weak k-mers
[Thu Dec 20 23:41:06 CET 2018] Building the graph

It is using about 80 GB or RAM and 2 cores, the input files are 200 Gb (interlaced.fq) and 100 GB of ONT data. The genome is about 5.2 Gb, from a heterozygous plant. I wonder if this long running time is acceptable or there is some issue going on and the computation is stuck. Thanks

dcopetti commented 5 years ago

Update: After 15 days running, it switched to the short read alignment:

[Thu Dec 20 23:41:06 CET 2018] Building the graph
/home/copettid/miniconda3/bin/HG-CoLoR: line 217:  9841 Killed                  PgSAgen_hgcolor $tmpdir/"$k-mers-$SR" $tmpdir/"$k-mers-$SR" >> HG-CoLoR.stdout 2>> HG-CoLoR.stderr
[Fri Jan 4 16:40:59 CET 2019] Aligning the short reads on the long reads

but the timing is very close with the start of another job on the same server, that was taking lots of memory. Is this still OK or is maybe the graph building step incomplete? Thank you for any feedback!

dcopetti commented 5 years ago

Update#2: The run completed, no output was written:

[Fri Jan 4 16:40:59 CET 2019] Aligning the short reads on the long reads
[Fri Jan 4 17:10:23 CET 2019] Removing short alignments
Traceback (most recent call last):
  File "/home/copettid/miniconda3/bin/filterOutShortAlignments.py", line 36, in <module>
    out.write(finalString)
NameError: name 'out' is not defined
[Fri Jan 4 17:10:23 CET 2019] Generating the corrected long reads
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --citation'.

[Fri Jan 4 17:10:23 CET 2019] Removing temporary files
[Fri Jan 4 17:10:28 CET 2019] Exiting

I wonder if the problem is due to the killed step above. Also, I will set the temp folder different than the output folder, that must help :-)

morispi commented 5 years ago

Hi,

As I just mentioned in a previous issue, it is a known issue that building the SR graph with PgSA is a blocking step for experiments on large dataset. This is mainly due to the fact the PgSA does not support parallel construction of the index. Replacing PgSA with a proper FM-index allowing parallel construction is on my TODOlist, but as I am currently writing my thesis, I can't promise you when it will be done.

Another known issue it that BLASR does not support reference files (in this case, the LR file) larger than 4Go. You will thus have to split your LR file into multiple 4Go file, and run separate HG-CoLoR instances on each of them. This will not impact the results, as each LR is processed independently. However, I know that it is highly impractical to process this way, and investigating for a better aligner is also on my TODOlist.

Moreover, you seem to be using a pretty old version of HG-CoLoR (still using GNU parallel). I do not maintain the conda versions myself, so I would recommand you to directly clone and compile the git repo if you can, in order to have the latest version.

Best, Pierre

morispi / HG-CoLoR

runtime estimation #10