vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 195 forks source link

vg autoindex-Correct input type not found while loading handlegraph::HandleGraph #3710

Open Flavia95 opened 2 years ago

Flavia95 commented 2 years ago

Hi, I tried this command line, but I had this bug. How can I fix this?

vg1.40 autoindex -g mouse.parentals.fa.gz.9934a13.417fcdf.53439a3.smooth.final.gfa -t 20
 -p mouse.parentals.fa.gz.9934a13.417fcdf.53439a3.smooth.final.autoindex
[IndexRegistry]: Checking for haplotype lines in GFA.           
[vg autoindex] Executing command: vg1.40 autoindex -g mouse.parentals.fa.gz.9934a13.417fcdf.53439a3.smooth.final.gfa -t 20 -p mouse.parentals.fa.gz.9934a13.417fcdf.53439a3.smooth.final.autoindex
[IndexRegistry]: Constructing VG graph from GFA input.            
[IndexRegistry]: Constructing XG graph from VG graph.    
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.                                                                                                                      
error[VPKG::load_one]: Correct input type not found while loading handlegraph::HandleGraph

Thank you, Flavia

jeizenga commented 2 years ago

Hmm. I'm not sure why that would happen. Are you able to share the GFA? You can email it to me at joeizeng@gmail.com.

jeizenga commented 2 years ago

I have a follow-up question. Do you know how much disk space and memory you had available when you ran into this problem?

Flavia95 commented 2 years ago

Yes, this is the situation.

Size  Used Avail Use
63G   4.2G   56G           
jeizenga commented 2 years ago

I suspect the problem is that you ran out of temporary storage space on your disk while constructing the GCSA2 index. The final GCSA2 is typically < 20 GB, but the indexing process can use quite a bit more than that in temporary storage. I'm not sure why vg autoindex tried to proceed without a finished index though.

In this particular case, you may have also run up against some limitations in the way we select pruning parameters (a simplification step that precedes GCSA2 indexing). When I ran the vg autoindex pipeline with the same inputs on a very large machine,autoindex eventually ran up against the software-defined 2 TB limit on temporary disk usage and aborted. Running up against this limit typically means that the graph was insufficiently pruned, and we've run up against the GCSA2's worst-case exponential space usage.

I do not currently have a good way to select the pruning parameters automatically, although I'm trying to get some discussions started over here on how we might do so. In the meantime, I'm afraid you'll probably need to use the more laborious manual indexing pipeline.