iqbal-lab-org / jVCF-spec

4 stars 0 forks source link

Graph restrictions #3

Open bricoletc opened 4 years ago

bricoletc commented 4 years ago

I'm trying to layout what the restrictions on graph imposed by jVCF are. I'll list criteria and what may happen if they are not met

Possibilities:

On that basis I'd say jVCF can represent variation on directed, acyclic graphs.

bricoletc commented 4 years ago

I think gramtools paper definition of a site requires DAG:

is outdegree(v) for a node v ∈ V , which is the number of edges (v, w) ∈ E where
w ∈ V . The second is an ordering of nodes in paths: for all paths (v i , .., v j , .., v n )
in G and j ∈ [i, n), node v j <= v j+1 . We define a site as a set of nodes s(v 1 , v 2 ) =
{v ∈ V |v 1 6 v 6 v 2 } for which outdegree(v 1 ) > 1 and v 2 is the first node where all
paths from v 1 end.
bricoletc commented 4 years ago

On the other hand, we could say that jVCF itself doesn't care what the graph is; it could be bidirected and have cycles; all that matters is that a set of variant sites is described, and the relationship between them is recorded.

leoisl commented 4 years ago

PS: after writing this, I think all of this is mainly what you wrote in your last comment, but will still leave it here...

Ahm, that is interesting... I am starting to think that jVCF don't even care if it represents a graph. It might represent a data structure that is more general than a graph. For example, all that we care to describe each site is mostly the alleles in the site, genotyping, haplogroup, etc:

image

And child map just represents an element inclusion property: image

A PRG is much more specific than this. It represents all the genomic allele sequences (some of the nodes - jVCF do not represent all of these, it might hide some), all the variations (other nodes), and how each sequence relates with each other and how sites include these sequences (edges - jVCF does not care how sequences relate to each other, it just needs the alleles; and it does not care about the structure of the site, it just needs the alleles of the site).

I think the spec description should be kept WRT NCDAGs, as it is a lot easier to reason about. But at the end of the spec, you might want to add a paragraph saying that jVCF can actually represent more general structures than NCDAGs

bricoletc commented 4 years ago

Or we could do the opposite, say what it represents, and how (as you say) that's not necessarily tied to what the sequence graph looks like, or how the sequence graph is represented, and then say we developed jVCF in context of NCDAGs?

leoisl commented 4 years ago

yeah, that would be nice! But would this trigger too many changes in your spec?

bricoletc commented 3 years ago

@leoisl i propose in branches/dev to change the graph restrictions section to requirements section, as so:

## Requirements
jVCF assumes:

* Variant sites have been defined on the genome graph
* Variant sites can be contained in other variant sites. It is known which sites
  are contained in which others.

For the latter point, a site contained in another occurs in a given sequence background.
Each sequence background must be labeled with a unique positive integer, called its **haplogroup**.
See this [toy graph](#example) for an example.

THere's a comment in the markdown about in the context of gramtools, it being developed for nested DAGs/NCDAGs, but I think this is enough?