shilpagarg / DipAsm

MIT License
75 stars 18 forks source link

memory: core dumped at haplotype_scaffold #20

Closed lukemn closed 3 years ago

lukemn commented 3 years ago

Hi Shilpa,

I'm running hifiasm/pstools as in #16 on an ~100Mb genome, expected to be mostly haploid. I'm assuming this shouldn't be a major issue? I don't really trust the base-level results of short-read HiC assembly/scaffolding on HiFi tigs, and I'm hoping DipAsm will do a better job of it.

I get through with some minor (I assume) complaints (a few ERROR: key not in position table during hic_mapping_haplo, and various rm errors during resolve_haplotypes), but then a core dump during the haplotype_scaffold stage. There are 56 utgs for each of hap1 and hap2 in pred_haplotypes.fa, each ~250Mb. Any thoughts?

Here's that log: start main All above 5M: 13 All above 1.5M: 44 Update best buddy score. Get potential connections 4. Insert connections. Save graphs and scores. Nodes in graph: 2. Left edges: 376. Update best buddy score. Get potential connections 4. Insert connections. Save graphs and scores. Nodes in graph: 4. Left edges: 184. Update best buddy score. Update best buddy score. Get potential connections 4. Insert connections. Save graphs and scores. Nodes in graph: 5. Left edges: 304. Update best buddy score. Update best buddy score. Finish get first scaffolds. free(): invalid pointer

shilpagarg commented 3 years ago

Thanks for pointing this. Yes, I have seen the invalid pointer error in non-human assemblies. I am working on this and will provide an update soon.

shilpagarg commented 3 years ago

Please try https://pstools.s3.us-east-2.amazonaws.com/pstools_1.

lukemn commented 3 years ago

Works, thanks!

I get 242 Mb of hap1 and 32 Mb of hap2, and 62 Mb in broken_nodes. hap1 is much bigger than expected, there may be some bacterial contamination in there. I have genetic map-based pseudochromosomes from other assemblies, so I'll go through these files to see what looks sensible.

Also, I guess you plan to get to this eventually, can you say something about what pstools is doing relative to the previous docker pipeline?

Is there good reason not to use the primary hifiasm contigs (or other assemblies), rather than the raw unitigs?

shilpagarg commented 3 years ago

Good to know. The pstools method is purely graph-based without any haplotype collapses and enables routine production of phased sequences. I will be happy to help further if you could send me an email. As I mentioned, I only tested for humans, but it will be interesting to see for other genomes.

Working on unitigs is better than contigs to avoid any random cross-chromosome or long-range chromosome connections. Instead, Hi-C information is powerful to disentangle such cases in the graph.

shilpagarg commented 3 years ago

Yes, I agree with that it depends on characteristics of genome. Specifically, Hi-C is helpful for genomes with complex centromeres, for example, humans. For small genomes with no centromeres, I understand HiFi would be good enough. Another aspect is cost-effective. IMO there is no generalized method that is best for every genome.

zhoudreames commented 3 years ago

Yes, I agree with that it depends on characteristics of genome. Specifically, Hi-C is helpful for genomes with complex centromeres, for example, humans. For small genomes with no centromeres, I understand HiFi would be good enough. Another aspect is cost-effective. IMO there is no generalized method that is best for every genome.

I use the pstools_1 agan runining my project,but i got error result ,the length of scaffold_0l_hap1 is ~1.5G ,longer than the biggest chromsome length(~300Mb),this why? image