huangs001 / AlignGraph2

Similar genome assisted reassembly pipeline for PacBio long reads
10 stars 1 forks source link

Program stuck running pagraph #3

Closed elcortegano closed 1 year ago

elcortegano commented 1 year ago

I'm running AlignGraph2 for a mouse genome (~2.7 Gb) with assembly and input files near 1 GB in size, and high read coverage (90 GB of data). I'm running this tool in two different clusters, and in both of them, the last file modified is one of:

./working_dir/pagraph/X/DONE

with different X number. The size of the working directory is also very different. However, both runs have been stuck now for several days. The command top reveals that AlignGraph2 is running pagraph. This is what ps aux shows about the process run:

pagraph -t 64 -r dummy -k outputdir/working_dir/solid_kmer_set.bin -c outputdir/working_dir/input/s/7/ctg.fasta -R outputdir/working_dir/input/s/7/ref.fasta -p outputdir/working_dir/input/s/7 -a outputdir/working_dir/input/s/7/aln -o outputdir/working_dir/pagraph/7 -r 50 --epsilon 10 -v 2

For this run, the last lines in the log file are:

Loading read to ref from 7.ref.ref
Done! aln number=216971
Mem=14259748
Pre Process
Process Mem=44661556
[PositionProcessor] Running read to contig...
[======================================================================]100.00% [Mem=84421556KB]
    merge edge = 815603723
    total pos = 1100917157
    merge pos = 818809321

[======================================================================]100.00% [Mem=87108532KB]
    merge edge = 557587036
    total pos = 840140230

This is the last line of log printed in four days. Is it normal for pagraph / aligngraph2 to take this long to run?

Could it be stuck? if so, is it safe to kill the process and restart it? would it restart from the begging or does aligngraph2 skip remaking files already present in the working directory and restore execution from the point where it was?

huangs001 commented 1 year ago

AlignGraph2 can skip the completed processing for which the result already present in the working direcotry. Restart the program will restore from the last point. The long process time may be caused by the too many k-mer positons in extreme situation, and this is where the program need to be optimzed. You can try running from beginning with sampled reads, taking 2/3 or half of the total reads.

elcortegano commented 1 year ago

Thank you @huangs001 , that was very helpful,

I've got now some output files. I imagine that the extended genome file is final.fasta, but nowhere in the documentation I find what are the other files generated (e.g. remainder.fasta). What do these files refer to? I'm a bit confused, since the final.fasta file only differs in that a few contigs have been removed, but no contig has been extended.

huangs001 commented 1 year ago

I'm sorry that I've been busy recently and haven't received the issue notification. The final.fasta is equivalent to merging add.fasta and remainder.fasta, while add.fasta includes the extended contigs and remainder.fasta includes the contigs that were not extended. And the extended contigs are joined by different original contigs, the connect_info.txt shows how to join.

elcortegano commented 1 year ago

Thank you for the clarifications!