nmdp-bioinformatics / gfe-db

Graph database representing IPD-IMGT/HLA sequence data as GFE
https://gfe-db.readthedocs.io
GNU General Public License v3.0
9 stars 15 forks source link

GFE not unique? #77

Closed mmaiers-nmdp closed 7 months ago

mmaiers-nmdp commented 1 year ago

There are two GFEs with the same name, this shouldn't happen in this database (e.g. need uniqueness constraint on g.name)

match (g:GFE)-[:HAS_SEQUENCE]-(s:Sequence) where g.name="HLA-Bw0-0-0-2410-0-5586-0-0-0-0-0-0-0-0-0"
    return g

image

also sequence should be 1-1 with GFE but these two GFEs that are the same string have 7 sequences associated

match (g:GFE)-[:HAS_SEQUENCE]-(s:Sequence) where g.name="HLA-Bw0-0-0-2410-0-5586-0-0-0-0-0-0-0-0-0"
    return g,s

image

Of these 7 sequences, 5 of them are unique and occur in GFEDB multiple times each:

match (g:GFE)-[:HAS_SEQUENCE]-(s:Sequence) where g.name="HLA-Bw0-0-0-2410-0-5586-0-0-0-0-0-0-0-0-0"
    return s.sequence, count(*)

Goal:

mmaiers-nmdp commented 1 year ago

@pbashyal-nmdp found the uniqueness constraints and the one for GFE.name needs to be udpated (it currently says GFE.GFE_name).

Upon discussion, we agree to update the model to have a single GFE node with Sequence as a "property". And then make a uniqueness constraint on GFE.sequence

The queries will be simpler and faster. The HAS_SEQUENCE edge can be removed

mmaiers-nmdp commented 1 year ago

Are these actually not unique in IMGT?

mmaiers-nmdp commented 7 months ago

This is fixed in the current database on dockerhub "chrisammon3000/gfe-db:latest" accessed 2024-02-12