Open bricoletc opened 4 years ago
I think gramtools paper definition of a site requires DAG:
is outdegree(v) for a node v ∈ V , which is the number of edges (v, w) ∈ E where
w ∈ V . The second is an ordering of nodes in paths: for all paths (v i , .., v j , .., v n )
in G and j ∈ [i, n), node v j <= v j+1 . We define a site as a set of nodes s(v 1 , v 2 ) =
{v ∈ V |v 1 6 v 6 v 2 } for which outdegree(v 1 ) > 1 and v 2 is the first node where all
paths from v 1 end.
On the other hand, we could say that jVCF itself doesn't care what the graph is; it could be bidirected and have cycles; all that matters is that a set of variant sites is described, and the relationship between them is recorded.
PS: after writing this, I think all of this is mainly what you wrote in your last comment, but will still leave it here...
Ahm, that is interesting... I am starting to think that jVCF don't even care if it represents a graph. It might represent a data structure that is more general than a graph. For example, all that we care to describe each site is mostly the alleles in the site, genotyping, haplogroup, etc:
And child map just represents an element inclusion property:
A PRG is much more specific than this. It represents all the genomic allele sequences (some of the nodes - jVCF do not represent all of these, it might hide some), all the variations (other nodes), and how each sequence relates with each other and how sites include these sequences (edges - jVCF does not care how sequences relate to each other, it just needs the alleles; and it does not care about the structure of the site, it just needs the alleles of the site).
I think the spec description should be kept WRT NCDAGs, as it is a lot easier to reason about. But at the end of the spec, you might want to add a paragraph saying that jVCF can actually represent more general structures than NCDAGs
Or we could do the opposite, say what it represents, and how (as you say) that's not necessarily tied to what the sequence graph looks like, or how the sequence graph is represented, and then say we developed jVCF in context of NCDAGs?
yeah, that would be nice! But would this trigger too many changes in your spec?
@leoisl i propose in branches/dev to change the graph restrictions
section to requirements
section, as so:
## Requirements
jVCF assumes:
* Variant sites have been defined on the genome graph
* Variant sites can be contained in other variant sites. It is known which sites
are contained in which others.
For the latter point, a site contained in another occurs in a given sequence background.
Each sequence background must be labeled with a unique positive integer, called its **haplogroup**.
See this [toy graph](#example) for an example.
THere's a comment in the markdown about in the context of gramtools, it being developed for nested DAGs/NCDAGs, but I think this is enough?
I'm trying to layout what the restrictions on graph imposed by jVCF are. I'll list criteria and what may happen if they are not met
Possibilities:
Graph is directed If graph is not directed, what is the POS of the following SNP, 2 or 3? adirected_graph.pdf
Graph is acyclic In following cyclic graph, we have one site with alleles C vs G. But then also an infinite number of sites with arbitrarily long alleles? cyclic_graph.pdf
Graph is non-crossing By non-crossing, i mean two variant sites can only share a node if one is contained in the other In following graph, this does not hold: a
G
node is part of several sites, (one os allelesCA
,GA
,GT
, the other is allelesAT
,GT
,GA
) crossing_graph.pdf Two things here:gramtools
requires sites to be 'non-crossing'. The reason is that variation is flanked with number IDs. It would not be possible to represent theG
node in the two variant sites in one place only:A 5 T 7 C 8 **G** 8 A 6 A 9 **G** 10 A 10 6 T
. I know this is horrible on the eyes but G node has to be duplicated.make_prg
produces non-crossing, directed acyclic graphs.Child_Map
would have two different sites contain the A/T SNP as a child.On that basis I'd say jVCF can represent variation on directed, acyclic graphs.