Make Giraffe Fully Simple

adamnovak commented 3 years ago

@benedictpaten wants Giraffe to be "fully simple". This has two elements, as I understand it:

We should have standards and specs for input formats we want from HPRC, to run on
We should have a pipeline suitable for a BME230 (2nd year graduate course) student to run Giraffe on an input (extended) GFA, without them complaining that they need a giant-memory machine, or that it takes forever to go through all the connected components, or that they can't understand how to build the indexes from the tool help.

So far, we have identified three things we need to do to set this up:

Support some form or forms of GFA-with-haplotypes as a single-file input format for indexing for Giraffe. We should be able to go straight from this (blunt but otherwise unprocessed file, possibly with large nodes or string/0 node names) to the indexes Giraffe needs. We want to take this as input for a new tool-oriented indexing approach (maybe vg index --giraffe), with help that points to to that and doesn't overwhelm the user with 15 different possible index formats they could make. This would require a bit of workflow work to make it so the one command can decide the right number of connected components to build indexes for in memory at a time, without running out of memory or taking forever in serial.
Combine the GBWT and GBWTGraph into one file. This would reduce the number of files that Giraffe needs to three (GBWT/graph, distance index, minimizer index). @jltsiren thinks this will be straightforward.
Abstract away node chopping and node boundaries, with some kind of system and specified format for translating between GFA coordinates and chopped-graph, numbered-node vg coordinates. The new grad students and other new users keep complaining that vg has chopped up their graph and that the node IDs are "wrong". If we have a way to tell them the coordinate translation we are using (or to maybe even output mappings in the original GFA coordinates?), they will hopefully be mollified. Eventually, we might like a way for the HandleGraph API to let you ignore where node boundaries fall.

@ekg What do you think? We also probably would need vg index --map and vg index --mpmap, right? And as for GFA-with-haplotypes, do we have a consensus on what style or styles we should accept for paths with haplotype semantics?

jeizenga commented 3 years ago

In https://github.com/vgteam/GetBlunted Ryan and I output a text-based translation table that identifies each node with the original sequence(s) that contributed to them. The tables look something like this:

#blunted_seq_name   original_subseqs
s1  read1[0:4]+,read2[3:7]+
s2  read1[4:7]+,read2[0:3]+

I think something similar might be useful for managing the chopping, along with some helpful stderr output alerting the user that the node IDs have been changed, why they have been changed, and that the translations can be found in the table.

jltsiren commented 3 years ago

Do we want a general graph translation format that can handle alignments, different levels of detail, and so on? Or do we just need a way to state that segment s1 in the original GFA corresponds to node sequence 1, 2, 3 in our graph? For the latter, we could use GFA P-lines in a separate file:

P   s1  1+,2+,3+
P   s2  4+,5+

Or we could create a new line type to avoid confusion:

T   s1  1,2,3
T   s2  4,5

We need some changes to libvgio to combine GBWT and GBWTGraph into a single file. We want to be able to say something like:

vg::io::VPKG::save(gbwt_index, filename, vg::io::VPKG::OVERWRITE_FILE);
vg::io::VPKG::save(gbwt_graph, filename, vg::io::VPKG::APPEND_TO_FILE);

GBWT/GBWTGraph construction from any GFA-like format is straightforward once we have figured out what to do with segments and path names. The difficult part is parsing GBWT metadata from path names and tags. I assume we are going to need many options and presets for the metadata.

adamnovak commented 3 years ago

I think if we're using the VPKG encapsulation on the GBWT and GBWTGraph we can just open a stream and VPKG save both of them to the same stream. I was picturing having something in GBWTGraph that serialized and loaded the two together, automatically linking them up, but we could also do it via VPKG with manual link-up.

ASLeonard commented 2 years ago

This looks to be resolved now (potentially needing a comment from #3356 regarding the --request XG). It took a fair bit of issue trawling to find this, so might be useful to update the giraffe wiki if people end up needing to relate mapping node IDs to gfa node IDs. If I get this working in the end, I'd be happy to set up a pull request with the commands used.

adamnovak commented 2 years ago

@ASLeonard We'd be delighted to have you update the wiki or make a documentation PR with your findings.

We abandoned the VPKG GBWT/GBWTGraph idea, but now we have GBZ which represents that combined data type.

We do still need the XG for surject, which is hard to get autoindex to make, and could be easier. But GBZ just this past week can hold paths, so we will eventually want to build out support for that in vg.

briannadon commented 2 years ago

Just want to chime in since my issue #3356 was mentioned:

After changing jobs and coming back to VG and graphs after a several months and a few VG updates, I tried again to use vg autoindex with vg giraffe and some other things to investigate alignments to graphs generated from PGGB. It seems the same issue of the giraffe GAM alignments not agreeing with the original GFA graph due to the reference-vs-haplotype issue remains. This was verified with vg validate -a, showing the graph was valid but the GAM was not. Again, adding --request XG to the vg autoindex command indeed does still entirely fix the issue. Perhaps this could be added as a default behavior for vg autoindex -w giraffe?

Separately, I would like to know if I'm losing some kind of valuable information by doing it this way. I'm working with a system wherein I don't want there to be one authoritative reference, and want every haplotype treated equally, if that makes sense. If this is a purely academic distinction I suppose it doesn't matter.

By the way, giraffe's runtime with the GBZ instead of separate GBWT and gg files is incredibly fast now, so kudos for that.

adamnovak commented 2 years ago

We're working on a way to store the original GFA's node names and boundaries alongside the re-chopped vg node IDs, so soon you will be able to get GAM or GAF output in the original GFA's coordinates.

Right now, even if you make the XG and use that for downstream analysis of the alignments, you still lose the GFA's coordinate space, so you either have to always do your downstream work in terms of vg's assigned node IDs, or use embedded paths (which could be an authoritative reference, but could also be any GFA P lines) to translate your results into coordinates someone else can understand.

briannadon commented 2 years ago

Can you expand a bit on what exactly "embedded paths" are? I've seen it mentioned and I'm not sure what that is. Sorry if that's available somewhere in the wiki, couldn't find it.

Also, what is the specific difference between the graph structure output by vg autoindex [...] --request XG and just chopping the graph yourself with vg mod -X 1000, besides lacking paths? Is there any reason I couldn't just chop the graph myself and the nodes would just match up?

jltsiren commented 2 years ago

There are two kinds of paths in VG: embedded paths (or just paths) that are stored in the graph itself, and lightweight paths (threads) stored in a GBWT index. Embedded paths are typically used for representing reference sequences and variants, while threads are better suited for storing a large collection of haplotypes.

When vg autoindex builds GBWT from GFA, it uses the chopping algorithm from GBWTGraph. That algorithm processes the segments in the order they appear in the GFA file, chops them into <= 1024 bp nodes, and assigns node ids starting from 1.

vg mod uses the chopping algorithm from libhandlegraph. It tries to process the nodes in the order defined by their original ids, and it tries to assign the chopped nodes new ids starting from 1, but neither is guaranteed. The exact behavior depends on graph implementation.

vgteam / vg

Make Giraffe Fully Simple #3126