nmdp-bioinformatics / gfe-db

Graph database representing IPD-IMGT/HLA sequence data as GFE
https://gfe-db.readthedocs.io
GNU General Public License v3.0
9 stars 15 forks source link

Consider changing the new schema (incremental_load) #69

Open mmaiers-nmdp opened 1 year ago

mmaiers-nmdp commented 1 year ago

In the new schema image there are HAS_IPD_ALLELE edges from both GFE to IPD_Allele and IPD_Accession to IPD_Allele

The first issue is that both edges should be vectors (currently the one from IPD_Accession to IPD_Allele is scalar.

But there is a more fundamental issue which is that this model will not capture situations where the sequence changes but the IPD_Accession and the IPD_Allele stay the same. These are exactly the type of inconsistencies that this graph database is well suited to discover and catalog. With that in mind I propose that we update the schema to have IPD_Accession have an edge directly to the GFE. Or rather a "HAS_IPD_ACCESSION" edge from the GFE to the IPD_Accession node which will be symmetrical to the "HAS_IPD_ALLELE" from the GFE node to the IPD_Allele node.

This new HAS_IPD_ACCESSION should have an array of versions as an attribute.

Also the minor version should be included (e.g. HLA00012.1 not HLA00012) as the full accession number with attributes of the major potion "HLA00012" to allow queries that join these to not have to parse the name in cypher to get only the part left of the ".".

image

mmaiers-nmdp commented 1 year ago
3016411 ID   HLA00789; SV 7; standard; DNA; HUM; 13220 BP.
3016412 XX
3016413 AC   HLA00789;
3016414 XX
3016415 SV   HLA00789.7
3016416 XX

Sub-version is in SV element and in the ID element