uio-cels / NucDiff

In-depth characterization and annotation of differences between two sets of DNA sequences
Mozilla Public License 2.0
59 stars 10 forks source link

Run time #31

Open tdlong opened 1 year ago

tdlong commented 1 year ago

I tried to run with two fly genomes, and after two days this software had not produced results. I am not sure how this program scales with genome size (or SNP density??). The multiple threads switch seems to not apply to the mummer calls.

It would perhaps be awesome if the nucmer/delta-filter steps could be done outside of the program, and then a simple python script processed the delta-filter output. I did not see how to easily modify the code to do this.

kseniakh commented 1 year ago

Hello tdlong,

Actually there is a possibility to run nucdiff with the already generated delta-filter file. You can read here for more detailed instructions.

Anyway, it shouldn't take so much time. I would really advise you to run nucmer first. If the problem is not connected to nucmer, then you can look at NucDiff runtime messages for more information.

tdlong commented 1 year ago

Thank-you for your rapid response!

I apologize, my post was not detailed enough. I was concerned this project wasn't active given the time stamps. I think tools like this are increasingly important as it is easier and easier to assembly genomes using PacBio HiFi (compared to even 2 years ago). I suspect people will wish to use these tools for larger and larger genomes.

I did run nucmer first (with my guess at the switches). Nucmer was very fast a few hours:

nucmer --threads=16 --maxmatch --noextend --prefix=DELTA ref.fa query.scaffold.fa

Then I ran nucdiff (after copying the delta file to OUT), more or less following the directions in the link you suggest.

nucdiff --proc 16 --ref_name_full yes --query_name_full yes --delta_file OUT/DELTA.delta ref.fa query.scaffold.fa OUT DELTA

It ran for 48 hours on a fairly high performance node before I killed it. Maybe it would have finished in another hour, maybe it would be for-ever. It is difficult to gauge progress.

The two genomes being compared are about 120Mb a piece (Drosophila). The delta file is 57 million lines long. For many species ... and I wonder if this is the issue ... there are SNPs distinguishing two strains every 50-100 bp. My impression from trying to read the source code, is that even with a delta file, the program is calling mummer to run delta-filter. For all I know the hang up is here since the python script does not leverage the multiple processors for mummer.

Perhaps there is a verbose output mode I am missing. Nothing is was written to my STOUT.

kseniakh commented 1 year ago

You can always check if it is nucmer or NucDiff that is hanging by checking whether the nucmer files were generated or not. Without checking output messages I cannot say you anything. Run the command directly or redirect stdout to the file and tell me what is the last output message. You may use one or two big scaffolds instead of the whole genome or even some small genomes for the testing purposes.