Open mbdebian opened 1 year ago
Some comments you might want to consider:
id
to diseaseId
. This will allow us to establish relationships between datasets more easily.code
to url
.
dbXrefs
to crossReferences
indirectLocationIds
and directLocationIds
.
parents/ancestors
and children/descendants
.
synonyms
fields.leaf
to isLeaf
Thanks a lot for the comments @ireneisdoomed ! Iβd like to add some in-line comments
Some comments you might want to consider:
- Rename
id
todiseaseId
. This will allow us to establish relationships between datasets more easily.
That would be really helpful when it comes to use this dataset ππ»
Rename
code
tourl
.
- Do we even need this? Could we be resolving the ID on identifiers.org?
If we have Compact Identifiers, we can use identifiers.org resolution services, thatβs correct. As EFO is an ontology, it has a special treatment on identifiers.org, and its URL could be
https://identifiers.org/EFO:0000255
- Rename
dbXrefs
tocrossReferences
ππ»
Remove
indirectLocationIds
anddirectLocationIds
.
- Not clear to me the content of these fields. They are null for all records except 34. I don't think we use them anywhere.
- Evaluate the potential redundancy between the fields
parents/ancestors
andchildren/descendants
.
If they donβt bring in data value, the lighter the schema the better
- To clarify, parents and children refer to the immediate nodes directly above or below a specific ID, respectively. In contrast, ancestors and descendants capture all nodes when propagating upwards or downwards from a particular ID.
- I understand the distinction, but parents/children are essentially a subset of the ancestors/descendants that we could perhaps model differently. Do we use these anywhere? Perhaps when propagating the evidence.
Then, it looks like ancestors and descendants are just computational sugar, and we need to make sure that rebuilding that information through parents and children across the dataset is actually coherent with this, with no loops, etc.
If using all the up tree or down tree as a client to this dataset is not a use case, I would remove this information.
- Agree with the renaming of the
synonyms
fields.
ππ»
- Agree with the renaming of
leaf
toisLeaf
ππ»
Background
ETL Disease step persists its analysis output according to this configuration block:
This issue is related to diseases output.
Diseases output data model
Within a data (pre)/releases bucket for a particular release, this output can be found at the relative path
The schema looks like this ππ»
And a content sample is shown below ππ»
Data model refactoring proposal
Data content refactoring proposal
Across the output data we have a mix of '[]' and null values for representing empty lists (arrays), it would be great if we could unify these into a default '[]' value for arrays, wherever they have no content. Other attributes of primtive types like string type would also benefit from having a meaningful default location, where possible and if applicable.