marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
294 stars 29 forks source link

Verkko+Hi-C performance problems on complex graphs (datasest without ultralong ONT) #287

Open tbenavi1 opened 1 month ago

tbenavi1 commented 1 month ago

Hello,

I have a couple of samples where we ran verkko 2.1 assemblies with no problem, but with verkko 2.2 they have errors on the hicPhasing step. Specifically, when I check the output of 8-hicPipeline/hic_phasing.err, it says that the job was killed. So, I ran with --fhc-run 8 512 24 --shc-run 8 512 24 to increase the memory. After rerunning, the assembly still failed to complete, and I am having difficulty figuring out what went wrong. There is a 26GB scaffolding.log file which I can upload if you think it will be useful. Thanks for any suggestions.

Dmitry-Antipov commented 1 month ago

Hi, Can you share verkko's console output? If it was not saved to file , two snakemake log files will work .snakemake/log/*.snakemake.log and 8-hicPipeline/final_contigs/.snakemake/log/*.snakemake.log

tbenavi1 commented 1 month ago

2024-09-13T054140.633180.snakemake.log And there is no final_contigs folder in 8-hicPipeline.

Dmitry-Antipov commented 1 month ago

According to that log looks like scaffolding is still running (and so no final_contigs is normal) If corresponding job 12607296 crashed or finished, can you send us scaffolding.log with DEBUG lines excluded (grep -v DEBUG scaffolding.log)? It shouldn't be huge.

If it is still running, let's just wait.

tbenavi1 commented 1 month ago

I see. It must have gotten canceled because it ran past the time limit. Here is the log with DEBUG lines excluded. scaffolding.nodebug.log

Dmitry-Antipov commented 1 month ago

Seems that your assembly graph has way more nodes than we used to. "ScaffoldGraph - Total nodes 160544" Usually in human-sized assemblies we see just couple thousands of nodes. Which species are you assembling? And do you use ONT reads or it is HiFi + Hi-C assembly?

I'll either add some optimizations or add "no scaffolding" option (in v2.1 we had only phasing with hi-c and not scaffolding), but complexity of graph already says that you likely will not get decent assembly.

For now you can run verkko with phasing-generated paths excluding scaffolding: ./bin/verkko --paths \<previous assembly>/8-hicPipeline/prescaf_rukki.paths.gaf --assembly \<previous assembly> -d \<new output dir> --hifi \<> --ont \<>

tbenavi1 commented 1 month ago

That makes sense. For this sample, the data is quite poor. It is a human assembly. We have HiFi (30x) + Hi-C + ONT (only 7x, and the N50 read length is 7kbp).

Dmitry-Antipov commented 1 month ago

Well, as soon as you have decent hifi data, short ONT reads should be quite useless. So although recipe with phasing-generated paths that I suggested above should work, I do not expect to achieve decent assembly. For now I'd actually try hifiasm first for hifi + hi-c combo - we didn't spend a lot of time optimizing verkko on such datasets.

But anyway, thank you for pointing on the performance problem!

tbenavi1 commented 1 month ago

Thank you!

Dmitry-Antipov commented 1 week ago

For the recent 2.2.1 release we've improved scaffolding performance on fragmented assembly graphs (HiFi + Hi-C or HiFi + Hi-C + short ONT like in this case). However, I'd still not consider verkko as a first choice tool for the assemblies without ONT data because of the lack of testing.

tbenavi1 commented 1 week ago

Thank you. I'll test it out and let you know how it goes.

tbenavi1 commented 1 week ago

Hello, with version 2.2.1, the assembly was finally able to finish with no problems. However, the assembly is much worse than when assembling the data with version 2.1. In particular, a lot more sequence is in assembly.unassigned.fasta than previously.

Dmitry-Antipov commented 1 week ago

Thank you for checking, that's not what I would expect.

Is the assembly less contiguous (N50-like metrics) too? Or just more contigs are unassigned? Can you share 8-hicPipeline/prescaf_rukki.paths.tsv 8-hicPipeline/scaff_rukki.paths.tsv 8-hicPipeline/hicverkko.colors.tsv 8-hicPipeline/unitigs.hpc.noseq.gfa from both assemblies? Those files should be relatively small.

tbenavi1 commented 1 week ago

Hello,

The assembly is less contiguous. Here is the previous Nchart: BLPt0001 verkko Nchart

Here is the current Nchart: BLPt0001 verkko Nchart

I unfortunately don't have the files above for the old assembly (though I can regenerate them if that would be helpful). Our cluster deletes files from scratch every 30 days. I have uploaded the new versions of those files here https://drive.google.com/drive/folders/1KFKzCtvmdFreMHiN2rEJZp3Ze9is9KOO?usp=sharing

tbenavi1 commented 1 week ago

Oh, and I should also note that the Ncharts above are only from the haplotype1 and haplotype2 reads. Any other reads are not included.

Dmitry-Antipov commented 1 week ago

Is there anything remained from the older run? Like assembly.homopolymer-compressed.noseq.gfa , assembly.colors.csv, assembly.paths.tsv from the assembly output root dir ? If not so, rerunning would really help us to compare results and debug.

Anyway, seems that phasing results do not look nice in 2.2.1. Can you also share 8-hicPipeline/hic_phasing.err 8-hicPipeline/hicverkko.log, 8-hicPipeline/phasing.log & 8-hicPipeline/hic.byread.compressed ?

tbenavi1 commented 1 week ago

I don't have anything remaining from the older run. I'll kick it off with verkko 2.1 and send you the files when I have them. I added the files you requested from the current run.

tbenavi1 commented 1 day ago

Hello, I am running this sample with verkko 2.1. However, I received the following error in 8-hicPipeline/run_mashmap.err:

---Running MashMap
/tgen_labs/barthel/software/miniforge3/envs/verkko2.1/bin/mashmap: error while loading shared libraries: libblis.so.4: cannot open shared object file: No such file or directory

I installed verkko with

conda create -n verkko2.1 -c conda-forge -c bioconda -c defaults verkko=2.1

I was able to resolve the issue with

conda install conda-forge::blis

I just wanted to let you know in case it was something that needed to be fixed. I'll finish the assembly and upload the files soon.

tbenavi1 commented 1 day ago

The verkko 2.1 assembly finished, but the results look similar to the verkko 2.2 assembly and are not the same as my previous verkko 2.1 assembly. However, one difference is that in the latest assemblies I gave the herro corrected ONT reads as hifi. So, to summarize: Previous verkko 2.1 assembly = didn't give herro corrected ONT reads as hifi reads, less sequence in unassigned.fasta. Current verkko 2.1 assembly = gave herro corrected ONT reads as hifi reads, more sequence in unassigned.fasta. Current verkko 2.2 assembly = gave herro corrected ONT reads as hifi reads, more sequence in unassigned.fasta. Let me know if you want any more files from me. Thanks.