inab / benchmarking-data-model

OpenEBench Benchmarking Data Model repository
Creative Commons Attribution Share Alike 4.0 International
2 stars 7 forks source link

Add label attribute to some schemas #72

Closed javi-gv94 closed 5 years ago

javi-gv94 commented 5 years ago

Following the changes in issue #66, we have completely removed meaning from the identifiers. However, we are missing some information in some schemas by doing so (specially in Challenges, Datasets, Metrics and TestActions). I would suggest creating a 'label / short_id / abbreviation' attribute to those schemas, kind of an acronym which indicates what it is found in that schema (like an internal id). For now, this only exists in the Community.

If we want to go one step further we can even think of some kind of hierarchy, so that, for instance, in a dataset, we can identify which challenge and tool the it belongs to, without having to call the API twice.

Some examples:

Schema Quest for Orthologs Cancer Genome Atlas
Community QfO TCGA
Challenge STD (Species Tree Discordance Test) BRCA (Breast Invasive Carcinoma)
Metrics RF-dist (Robinson-Foulds distance) TPR (True Positive Rate)
TestAction STD:PhylomeDB:testEvent BRCA:e-Driver:metricsEvent
Dataset STD:PhylomeDB:participant BRCA:e-Driver:assessment
jlgelpi commented 5 years ago

I would use this for identifiers that come from the community. If these are for recovering the old schema, I would prefer to complete the schema with the fields that may be missing ( much easier to browse as field than in a single string). Otherwise this would finally recover the mess that we were trying to avoid -)

javi-gv94 commented 5 years ago

Then we could keep a label attribute in the Community, Challenge and Metrics; as those acronyms usually come from the community. Regarding the TestActions and Datasets, maybe we could include that label next to a pointer to a Challenge/Tool/Community every time it is used. I guess that wouldn't be that messy. See example file:

example_participant_dataset.txt

jlgelpi commented 5 years ago

In general, when any piece of data has an id or label that comes from the Community it should be stored in the corresponding document. There is no need to repeat them in any reference as they are stored in the main entry, since labels can be obtained easily. I.e. Datasets come from Challenges, we would have the challenge id (OPBC....) and in the Challenge document keep everything that is needed. Repeating labels on each reference generates a probable source of inconsistencies, in the case that label is changed or updated.

JuergenHaasSIB commented 5 years ago

these are non-binding properties: auto-generate labels (aids data filtering/retrieval) and store them vs auto-generate on-the-fly (annotation for human readability)

jmfernandez commented 5 years ago

Fixed in commit 07469a78ced081ca3eeb9d59c5af079f87f3b37d , where attribute orig_id is added to several concepts