vgteam / libhandlegraph

Library for the Handle Graph abstraction
MIT License
20 stars 3 forks source link

Haplotype interface #75

Closed glennhickey closed 2 years ago

glennhickey commented 2 years ago

I'd like to add an handlegraph interface to get at the GBWTmetadata, like sample and haplotype. This would be useful for reading gbz/gbwt, but also for going back and forth between other formats. It would look something like

virtual std::string get_sample_name(const path_handle_t& path_handle) = 0;

virtual int64_t get_haplotype(const path_handle_t& path_handle) = 0;

/// I think this could be an alias to get_path_name
virtual std::string get_contig_name(const path_handle_t& path_handle) = 0;

/// I'd also like to add subpath information while we're at it
virtual std::pair<int64_t, int64_t> get_path_offsets(const path_handle_t& path_handle) = 0;

For GBWT, this information can all be extracted from the metadata. For any other type of graph, we could just use the naming convention from PGGB

I think we could ensure pretty decent backwards compatibility. A path named "chr1" would return "chr1" as the name/contig, 0 as the haplotype, an empty string (or default value) for sample, and -1,-1 a the range.

I think it'd be easiest to add this to the existing path handle graph interface, but it could always be a new interface as well. If this were implemented, it becomes trivial to convert back and forth from gbz and other formats, and avoids a lot of the hacky nonsense like I put in deconstruct to support both haplotype from metadata as well as from naming conventions.

Also hope this will allow graphs and tools to stay interoperable even in light of the whole gfa w-line fiasco!

Curious to hear your thoughts @jltsiren @ekg @adamnovak @jeizenga

jltsiren commented 2 years ago

This is in conflict with how I specified reference paths in GBZ and how we made GBWTGraph a PathHandleGraph.

GBZ stores reference paths as haplotypes with sample name _gbwt_ref and the path name as a contig name. A path_handle_t is an offset in the cache that stores some information about reference paths. When we access the same path as a haplotype using a GBWT path identifier, its name becomes { "_gbwt_ref", "contig", 0, 0 }, where the contig name often encodes both reference name and contig name.

get_path_offsets cannot be supported using GBWT metadata. The final field in a path name is often used as a fragment identifier that makes otherwise identical path names unique. In such situations, the field gets consecutive values starting from 0 for each underlying path. Sometimes the field is used for storing arbitrary integers parsed from GFA path names. Sometimes it stores the starting offset from a GFA W-line. Even in that case, the end cannot be determined except by extracting the entire path from the GBWT and querying the length of each node in the graph, which takes a few seconds with long paths.

I notice that GFA W-lines have changed since I last looked at them. My code assumes that the sequence start position is a mandatory field, and the GFA it outputs claims to be version 1.0. It's too late to change any of that before I submit the GBZ paper, unless I delay the submission to mid-April or beyond.

glennhickey commented 2 years ago

Thanks for the feedback. I was under the impression I could get a path_handle_t for any haplotype in the GBZ. But if the path interface is only available for a subset of reference paths (that's what you're saying, right?), then this won't really work as transparently as I'd hoped.

jltsiren commented 2 years ago

GBWTGraph as a PathHandleGraph is based on the model, where the XG index stores paths and the GBWT index stores threads, or where reference sequences are stored as P-lines and other sequences as W-lines. Now the paths are also accessible from the GBWTGraph.

Some parts of the PathHandleGraph implementation are inefficient, because they require operations the GBWT does not support well. Some other parts are only efficient because the implementation caches the information when the graph is built or loaded. Caching it for all paths in the GBWT would be very slow.

glennhickey commented 2 years ago

OK, makes sense, thanks. Not sure where I got the idea that all paths were accessible.

I guess an interface like this, or some other centralized api for applying the naming conventions to reference paths, could still be useful for going back and forth between haplotypes and reference paths. Perhaps this already exists somehwere? Ex: I want to promote a haplotype thread (maybe extracted with vg paths) to a reference path for the sake of modifying something, then write it back as a thread (ie going back and forth between W and P lines)

adamnovak commented 2 years ago

It would be useful to have some way to store things in e.g. PackedGraph that are semantically GBWT threads or GFA walks, because Glenn needs to do this when he is building the HPRC graphs.

I also noticed @jeizenga and @jonassibbesen (ab)using the GBWT to store genes and haplotype-specific transcripts, without a really strong way of remembering that that's what's going on. And also we've never elevated our subpath bounds to a first-class part of the API.

So here's a sketch of what we would maybe want the most general metadata API to look like:

/**
 * Metadata features for paths and threads.
 *
 * Provides default implementations for everything on top of string names, and
 * using path handles as the handles.
 *
 * Ought to be inherited by (or rolled into?) PathHandleGraph.
 *
 * Our model is that a path-or-thread has some of these things associated with
 * it:
 *
 * - A "sense", which describes how it is meant to be interpreted. Is this
 *   meant to be a positional path defining an assembly coordinate space, or a
 *   description of a phased inferred genome for a sample, or a path describing
 *   a transcript, or something else?
 *
 * - Sample or assembly name.
 *
 * - Contig or gene name ("locus") that the path-or-thread either represents in
 *   its assembly or is an allele of in its sample.
 *
 * - Haplotype (0 or 1 for diploid) number within that sample.
 *
 * - Phase block number within a haplotype. Must be unique within a smaple,
 *   locus, and haplotype.
 *
 * - Bounds, for when a path-or-thread as stored gives only a sub-range of a
 *   conceptually longer path-or-thread. Multiple items can be stored with
 *   identical metadata in the other fields if their bounds are
 *   non-overlapping.
 */
class PathMetadataGraph {

public:

    /// These senses can be used as bit flags for querying collections of paths-or-threads.
    /// Each item always has just one sense.
    enum Sense {
        SENSE_NONE = 0, // Nothing can actually have this sense.
        SENSE_UNSPECIFIED = 1,
        SENSE_REFERENCE = 2,
        SENSE_HAPLOTYPE = 4,
        SENSE_TRANSCRIPT_REFERENCE = 8,
        SENSE_TRANSCRIPT_HAPLOTYPE = 16,
        // We could move from our name format for detecting these to using a sense, locus (hash), and haplotype (alt number)
        SENSE_ALT_ALLELE = 32,
        // More to be defined later
        SENSE_ANY = 255
    };

    using path_or_thread_handle_t = path_handle_t;

    /// What is the given path or thread meant to be representing?
    virtual Sense get_sense(const path_or_thread_handle_t& handle) const;

    /// Get the name of the sample or assembly asociated with the
    /// path-or-thread, or NO_SAMPLE_NAME if it does not belong to one.
    virtual std::string get_sample_name(const path_or_thread_handle_t& handle) const;
    static const std::string NO_SAMPLE_NAME = "";

    /// Get the name of the contig or gene asociated with the path-or-thread,
    /// or NO_LOCUS_NAME if it does not belong to one.
    virtual std::string get_locus_name(const path_or_thread_handle_t& handle) const;
    static const std::string NO_LOCUS_NAME = "";

    /// Get the haplotype number (0 or 1, for diploid) of the path-or-thread,
    /// or NO_HAPLOTYPE if it does not belong to one.
    virtual int64_t get_haplotype(const path_or_thread_handle_t& handle) const;
    static const int64_t NO_HAPLOTYPE = -1;

    /// Get the phase block number (contiguously phased region of a sample,
    /// contig, and haplotype) of the path-or-thread, or NO_PHASE_BLOCK if it
    /// does not belong to one.
    virtual int64_t get_phase_block(const path_or_thread_handle_t& handle) const;
    static const int64_t NO_PHASE_BLOCK = -1;

    /// Get the bounds of the path-or-thread that are actually represented
    /// here. Should be NO_SUBRANGE if the entirety is represented here, and
    /// 0-based inclusive start and exclusive end positions of the stored 
    /// region on the full path-or-thread if a subregion is stored.
    virtual std::pair<int64_t, int64_t> get_subrange(const path_or_thread_handle_t& handle) const;
    static const std::pair<int64_t, int64_t> NO_SUBRANGE{-1, -1};

    /// Loop through all the paths or threads with any of the senses ORed together in senses.
    template<typename Iteratee>
    bool for_each_path_or_thread(const Sense& senses, const Iteratee& iteratee) const;

    // TODO: Put this in a mutable interface

    /// Add a path or thread with the given metadata. Any item can be the
    /// corresponding unset sentinel (NO_LOCUS_NAME, NO_PHASE_BLOCK, etc.).
    ///
    /// Implementations may refuse to store paths-or-threads of certain senses
    /// when relevant fields are unset.
    virtual path_or_thread_handle_t create_path_or_thread(const Sense& sense,
                                                          const std::string& sample,
                                                          const std::string& locus,
                                                          const int64_t& haplotype,
                                                          const int64_t& phase_block,
                                                          const std::pair<int64_t, int64_t>& subrange);

};

I want to make path_or_thread_handle_t just be path_handle_t, and all the xxx_path_or_thread_xxx methods just xxx_path_xxx, if we can let GBWTGraph issue path handles for all its haplotypes, and tolerate step operations sometimes being slow on SENSE_HAPLOTYPE paths.

@jeizenga and @jonassibbesen, would it make sense to tag transcripts and their haplotypes as such? And @jltsiren would it be feasible to hide that information somewhere in the GBWT? Maybe by naming or marking some contigs as genes? (We might save building this part out for later; we can always add more SENSE_* values.)

This is a little more data than wants to be put into the PGGB naming convention, because we want to be able to store the senses and ranges, and the phase block numbers that the GBWT uses. The copy-into-GBWT and copy-into-GFA code might get a bit fiddly, since neither can store all the senses, and GFA can't really store arbitrary phase block numbers.

I think we do need the sense information, because we need to be able to e.g. do a positional overlay on a subset of the paths in the graph, and the default right subset is going to be the ones with SENSE_REFERENCE. (If we do change the GBWTGraph to have some notion of multiple references, we'd want to set sample and locus/contig to e.g. "GRCh38" and "chr1", and use something other than the magic sample name to denote the reference sense.)

We could implement this as a slightly extended mini-format over the existing path names (compatible with whatever @glennhickey is doing for subranges right now), and push down special-purpose storage of the metadata into the graph implementations when convenient for better compression.

adamnovak commented 2 years ago

For a name mini-format, I think we can do <sample>#<phase>[.<phase block if not 0>]#<locus>[\[<range start>:<range end>\]]. We might also want to tolerate just <locus>, <sample>#<locus> or <sample>.<locus> (while somehow not keying on GL<whatever>.<version>), along with ranges on those, because we have existing graphs that use those name structures.

For determining/storing sense in names, we can maybe use the format that stores phase for haplotype sense, and one of the formats that doesn't for reference sense by default, and then when we do get a reference sense path with a phase we tag the name somehow to denote sense. Maybe tack on another #-delimited field?

The name mini-format should really be internal to the graph implementation (with a default implementation). Implementations like the GBWTGraph would not use it at all but instead would answer metadata questions from their own structured metadata. Copying paths with metadata intact would need to use the metadata API, not just copy the path names. Serializing to GFA would use GFA-specific conventions for representing things, and we might have to lose e.g. haplotype numbers assigned to reference-sense paths, or do something about fragmented phase blocks which W lines can't quite spell.

jltsiren commented 2 years ago

The way I see it, there are now two competing data models:

I can support both in GBZ, even in the same graph. If we want to add a third model, we should define how the new model should be represented in GFA and how do we deal with GFA files where some paths are in this model and some paths are not.

SENSE_UNSPECIFIED and SENSE_HAPLOTYPE can be used cleanly, because they match the semantics of P-lines and W-lines. SENSE_REFERENCE could be added if a convention/standard emerges for specifying reference paths as P-lines and/or W-lines. (It looks like the latter is going to be the preferred approach in the future.) I don't like the idea of hardcoding other senses in the interface if the underlying data does not support them.

Then there is the issue of sensible default behavior. Assume that the user has a 1000GP graph with 5000+ haplotypes in GBZ format. If they use that graph somewhere that expects a PathHandleGraph, should it expose no paths, reference paths only, or all paths? If they convert the GBZ graph to HashGraph, should the new graph contain no paths, reference paths only, or all paths? And if they then use the HashGraph as a PathHandleGraph, should it expose the same paths as the GBZ graph?

Also, W-line specification says that haplotype index is 0 for haploid samples and starts from 1 for other samples.

glennhickey commented 2 years ago

@adamnovak I like this a lot. I'm also very happy to replace the current python-style subranges I've hacked into vg to just use #'s if it's more consistent.

@jltsiren I think we can merge the two models for most intents and purposes by using defaults that are treated consistently with the current behaviour (though it would be nice to get path_handle_t's for every haplotype one day):

Going from W-line/GBWT -> P-line/path_handle_t: SAMPLE#HAPLOTYPE#CONTIG#OFFSET(straightforward, just cat all the fields with # separator)

Going the other way, from P->W, we just need to come up with some vg-convention defaults, ex: chr1 -> . -1 chr1 * * (we avoid serializing this to gfa, these defaults are not part of the spec but vg specific) GRCh38#0#chr1 -> GRCh38 0 chr1 * * GRCh38#chr1 -> . -1 GRCh38#chr1 * * (only do parse with 2 or 3 #'s)

In mutable graphs, this lets us treat everything the same. In GBZ, by default, we only give path_handle_t's if the haplotype index is >= 0.

To serialize the defaults to valid GFA, we can't have a -1 haplotype so I think we need a vg convention (maybe 255 for the haplotype index? or an optional tag flagging W-lines as a reference). I also like the idea of letting the user override this via command line option where possible.

jltsiren commented 2 years ago

I would avoid using negative values. GBWT path name components are stored as uint32_t, but the interface uses uint64_t, which is the default integer type in SDSL. If you set a component to -1, the value you get if you read it later may not be -1.

glennhickey commented 2 years ago

OK, makes sense.

This seems like the main sticking point (to me at least): How to flag a subset of W-lines as reference paths when going from GFA -> GBWT/GBZ? I agree that special values in the existing fields are hacky.

As a user, I'd be happy with just specifying it via command line option upon GBWT creation, but it'd be great to have a way of flagging them in the file. I think we're left with just optional tags for that?

adamnovak commented 2 years ago

We could find/define some tags to mark W lines as wanting to be interpreted as references (or P lines as wanting to be interpreted as haplotypes?). Or we could make the user specify sample names that are really references, maybe with a default list if we want "GRCh38" to Just Work out of the box.

Maybe we should start with a SENSE_HAPLOTYPE (which uses the GBWT/GFA W-line model), a SENSE_UNSPECIFIED (which we by default read and write as P lines), and maybe a SENSE_REFERENCE which we just assign to anything that isn't a haplotype but that has a 2-part name.

I don't think we need to work out how to spell an unspecified-semantics path or a reference contig as a W line, and if we decide to parse a W line to a reference contig we can throw out everything but sample and contig.

Instead of letting the user fill in all the different kinds of fields for any path, we can say that SENSE_UNSPECIFIED has exactly locus, SENSE_HAPLOTYPE has exactly sample, locus, haplotype, and phase block, and SENSE_REFERENCE has exactly sample and locus. And then anything can have bounds, or, as Glenn suggested, one bound.

Then we never really have to store the no-value sentinels, so we don't need to worry about signed/unsigned storage.

I really want these all to be path_handle_ts. We'd have to adjust our graph copying to do the right thing to do with the GBZ haplotypes; sometimes you're going to want the graph with the reference paths, and sometimes you really are going to want to dump one or a few of the haplotypes to another graph format.

jltsiren commented 2 years ago

When we are dealing with W-lines, the fourth component of a GBWT path name is the start of the subrange. In other GBWTGraphs, it may also be a phase block / fragment index. The second interpretation is inconvenient, because we can't convert such paths to valid W-lines. We can't convert phase block indexes to subrange starts automatically, because there may be overlaps and gaps between the fragments.

I don't like making Sense an enum, because it's really a bit mask and a path may have multiple senses. Either make it a bit mask or a proper type.

It would make sense to specify that get_subrange may return (start, -1) if the end is not stored explicitly. The caller can then compute it manually if they need it.

As for the name components, there are three cases:

I could use a reference tag in the GBWTGraph to specify the name of the sample used as the preferred reference, with _gbwt_ref as the default if the tag is missing. The reference would then consist of either all named paths or a clean subset of haplotype paths. We just need a way of storing this information in a GFA file. Maybe we could finally start using GFA headers for something?

Do we want to expose all GBWT paths through the PathHandleGraph interface, particularly for_each_path_handle and for_each_step_on_handle? It seems to me that many applications using a PathHandleGraph expect seeing only reference paths. A huge number of overlapping paths could confuse them and/or make them unacceptably slow. I kind of like the idea that some path handles are only accessible using for_each_path_or_thread, and maybe a PathMetadataGraph version of for_each_step_on_handle.

adamnovak commented 2 years ago

I had the sense being bit-mask-able so you could OR it for queries, to ask for multiple categories at once. I don't think we want a single path to be in multiple categories, so maybe it wants to be a normal enum so that that's not possible to ask for.

I think Glenn wants the GBWT to expose all the threads in for_each_path_handle and for_each_step_on_handle. We'd have to go to each call site and work out whether we really want all the paths, or just those of a particular sense. It looks like we have something like 200 calls to each of these in our codebase, which might be impractical to address all at once. So maybe we do need to hide the haplotype-sense paths by default from the existing interface, even though it's awkward.

I'm not sure a single slot for a custom reference name in the GBWT will quite do what we want. What would we do for the CHM13-based HPRC graphs, which have CHM13 and GRCh38 reference paths?

It is annoying that we can't really convert our VCF-based graph haplotypes to proper W lines, because the W line pieces need to be in a scaffold and not overlap. Maybe we take the reference gap approach and just arbitrarily slap 10kb gaps between the pieces, even when we think they overlap?

jltsiren commented 2 years ago

I was thinking about references as default values. If an application requires a reference and the user does not provide one, the one specified in the graph will be used. That won't work if there are multiple references. But because the user may choose to use another reference, any (sample, haplotype) should have the same properties as the designated reference. From that perspective, named paths and haplotype paths are data types, while reference paths are created using annotations that don't change the underlying type.

Similarly, if the name of a path can be parsed in a specific way, this should be marked with annotations that don't change the data type. The GFA file is the ground truth as long as we accept graphs from external tools, and all paths must be ultimately written as P-lines or W-lines.

I think the typical overlaps in VCF-based haplotypes are tens of bp, while the typical gap is 1 bp. I don't know what would be the least confusing way to handle the situation.

glennhickey commented 2 years ago

I think Glenn wants the GBWT to expose all the threads in for_each_path_handle and for_each_step_on_handle

While this would be nice, I understand it's not practical and don't want to get too sidetracked on it.