Closed UlrikeS91 closed 2 years ago
@UlrikeS91 & @Majpuc thanks for this issue. Here my feedback:
Rest looks good to me :)
one thing: we could change "preferredOntologyIdentifier" to just "ontologyIdentifier" and allow multiple entries
- I think "species" should not be part of this schema since this schema will not stand alone but always be connected to a subject which has already a species attached. We can also keep it, but that would mean another graph validation point.
I'll see what Maja says, but I think you are right.
- "background" why not "breedingBackground" ? (only background I found confusing)
Yeah, like I said in the comment, I was pretty sure that this is a bad name for it :D I'll see if Maja has a better suggestio but "breedingBackground" would already improve it, I think.
- "geneticBackground" why not "geneticStrainType" ? (I do not have a strong opinion here)
Same as above.
- "phenotypicDescription" why not "phenotype" ? and leave it as free text (for now; note: we can remove "phenotype" from specimen, correct?)
If we were to keep "phenotype" as a controlledTerm schema and property for specimen (called "phenotype"), we would need to remain consistent about this. ONLY if we remove phenotype all together (schema and link on specimen), then we can use "phenotype" with free text as expected value.
one thing: we could change "preferredOntologyIdentifier" to just "ontologyIdentifier" and allow multiple entries
I like that suggestion, but we have use "ontologyIdentifier" (expecting IRI) for many of the SANDS schemas only allowing 0 - 1. Maybe we should stick to that or we would need to adjust the SANDS schemas. Either would work for me.
I guess if we just ask for "ontologyIdentifier" we should always allow multiple and only when we ask for "preferredOntologyIdentifier" we ask for 1
For phenotype: is there any reason to keep the property on the subject if that is covered in the strain schema? or differently asked: is the phenotype something independent of a strain definition? then it should be kept on the subject and removed from strain schema (the combination would still be in place); if the phenotype is something solely depending on a strain definition it should be kept for the strain schema and removed from the subject
Following a very frutiful meeting with @Majpuc and the feedback/suggestions in this thread, I adjusted the draft:
Property | Count | Expected value | Notes/Thoughts/Comments |
---|---|---|---|
Name | 1 – 1 | free text | unchanged |
Synonyms | 0 – N | free text | unchanged |
Description | 0 – 1 | free text | unchanged |
Identifier | 0 – N | free text OR combination of "core/stringParamter" & other identifier-schemas (e.g. RRID) | thought with the stringParameter is that it would make it more specific what the identifier is since we would expect MGI, MGD, RGD and more here, many are unique and rather specific, but I'm unsure if this applies to all |
ontologyIdentifier | 0 – N | IRI | as suggested by @lzehl (and liked by @majpuc) |
backgroundStrain | 0 - 2 | core/Strain | suggested by @majpuc, each strain displays a specific set of traits and when strains were crossed, the main traits may come from one or the other or share 50/50, but when e.g. backcrossing to go back to the main trait coming only from one strain, this would be the backgroundStrain (so the strain that causes the majority of the prominent trait) |
breedingType | 0 - 1 | controlledTerms/(strain)BreedingType | this contains what I envisioned in the previous "background", so e.g. inbred/outbred/etc. |
geneticStrainType | 1 – 1 | controlledTerms/GeneticStrainType | only property name changed following suggestion from both @lzehl and @majpuc (independently even) |
phenotype | 0 – 1 | free text | property name changed (both @lzehl & @majpuc commented on it), @majpuc and I landed on free text being better than a controlledTerm, could be discussed again |
diseaseModel | 0 – 1 | controlledTerms/diseaseModel | unchanged |
stockNumber | 0 – 1 | core/stockNumber | @majpuc pointed out that a stock number alone is too ambiguous, it should always come with the vendor information (see suggestion for stockNumber-schema below) |
LaboratoryCode | 0 – 1 | regex (first letter uppercase, followed by all lowercase) | kept as is BUT my definition kind of changed because @majpuc commented on it and is right about that; what I envisioned was that this goes hand-in-hand with stockNumber, but not all vendors are potential registered in the ILAR and it seems more intuitive that this refers to the person/organisation that has actually CREATED the strain (which is also what it is mostly used for and not to express who has the strain in stock); property could also be renamed to something like "Creator" or similiar |
Property | Count | Expected value | Notes/Thoughts/Comments |
---|---|---|---|
Identifier | 1 – 1 | free text | I don't believe it is possible to add a regex here (vendors handle them too differently) |
Vendor | 1 - 1 | core/organization | I think organization should cover it, but feel free to suggest changes :D |
About the phenotype on specimen: Theoretically, every subject has a phenotype but I don't know how useful this would be. For humans, this is never stated. How is it for e.g. monkeys? If I'm not mistaken they would typically also not state a strain, but what about a phenotype? I have only ever came in contact with phenotypes on a subject level when a strain was relevant in the first place...
We now landed on the following:
Property | Count | Expected value | controlledTerms examples |
---|---|---|---|
name | 1 – 1 | free text | |
synonym | 0 – N | free text | |
description | 0 – 1 | free text | |
identifier | 0 – N | free text | (expects e.g. MGI, MGD, RGD) |
RRID | 0 – 1 | core/RRID | |
ontologyIdentifier | 0 – N | IRI | |
backgroundStrain | 0 - 2 | core/Strain | |
breedingType | 0 - 1 | controlledTerms/breedingType | inbred, outbred, hybrid, etc. |
geneticStrainType | 1 – 1 | controlledTerms/GeneticStrainType | wildtype, transgenic, knock-out, knock-in, etc. |
phenotype | 0 – 1 | free text | |
diseaseModel | 0 – 1 | controlledTerms/diseaseModel | see existing instances |
stockNumber | 0 – 1 | core/stockNumber (embedded) | |
laboratoryCode | 0 – 1 | regex (first letter uppercase, followed by all lowercase) |
Property | Count | Expected value |
---|---|---|
identifier | 1 – 1 | free text |
vendor | 1 - 1 | core/organization |
In addition, I will:
sorry to chime in rather late, but I would like to argue for restoring "species" as a required field. When building interfaces, it would be nice to use this schema to present a list of strain choices to the user. To narrow down the list of strains we need to be able to filter by species.
sorry to chime in rather late, but I would like to argue for restoring "species" as a required field. When building interfaces, it would be nice to use this schema to present a list of strain choices to the user. To narrow down the list of strains we need to be able to filter by species.
No worries and it's not too late yet ;)
We kicked it out because specimen have properties for species and strain. If the strain includes a species as well, we introduce yet another graph validation point. But since species is a required field for specimen, it means that every specimen that has a strain also has to have a species. So, it would be possible to narrow down the list of strains based on species. Or would you say that this is not enough?
We kicked it out because specimen have properties for species and strain. If the strain includes a species as well, we introduce yet another graph validation point.
I understood the reasoning. I don't understand how big a problem it is to have another graph validation point.
But since species is a required field for specimen, it means that every specimen that has a strain also has to have a species. So, it would be possible to narrow down the list of strains based on species. Or would you say that this is not enough?
These schemas will be used independently of Specimen, for example in the case of models, or in building user interfaces.
@apdavison & @UlrikeS91 I think Andrew raises a valid point if a strain with a specific stock number be used by two different labs (@UlrikeS91 or @Majpuc is this true?). And we have enough graph validation points that we anyway have to deal with at one point. So from my side, feel free to reintroduce "species"
Hi, yes it is thinkable that several labs can use the same strain from one specific vendor and thereby having the same stock number. I must admit I don't fully grasp the implications of reintroducing "species" or not and leave that to you to decide.
@apdavison & @UlrikeS91 I think Andrew raises a valid point if a strain with a specific stock number be used by two different labs (@UlrikeS91 or @Majpuc is this true?).
Yes, very much possible and for wildtype strain I would even call it more likely than the other way around.
Edit: I don't see why this is relevant though because strain is not an embedded type for specimen(sets), so it could be "recycled" for several specimen(sets) anyway. Only validation point already existing this way, would be if dataset A has a subject with mus musculus as spieces and links C57BL but dataset B a subject with rattus norvegicus and also links to C57BL...
And we have enough graph validation points that we anyway have to deal with at one point. So from my side, feel free to reintroduce "species"
Maybe this is an alternative that solves the validation issue and the more direct linkage of strain and species:
Single specimen would have 1 - 1, either stating only a species (e.g. homo sapiens) or a strain that has the species (e.g. C57BL with species: mus musculus) specimen sets would have 1 - N, where a mix can be used (e.g. a comparative dataset that has both human and mouse data, the subject group would then have homo sapiens (species) + C57BL/mus musculus (strain))
and how would you call the corresponding property?
I would just have it twice then, on specimen(set) and on strain. It makes respective queries easier (query for strains only by species, query for subjects by species). And the plan is that we introduce a graph validation
and how would you call the corresponding property?
Found this online: "[...] strain is a sub-type of a genetic variant of biological species."
So, sticking to "species" as a property name could work
@olinux and @apdavison what do you think about @UlrikeS91 suggestion?
I like it. The disadvantage is that it makes it a bit more complex to retrieve the species from a specimen, since you have to handle both possibilities (specimen->species and specimen->strain->species). The advantage is that it avoids the redundancy (and source of possible conflicts) of having the species in two places in the graph.
On balance I think the advantages outweigh the disadvantages.
I like it. The disadvantage is that it makes it a bit more complex to retrieve the species from a specimen, since you have to handle both possibilities (specimen->species and specimen->strain->species). The advantage is that it avoids the redundancy (and source of possible conflicts) of having the species in two places in the graph.
On balance I think the advantages outweigh the disadvantages.
Thank you, @apdavison.
@olinux, do you have an arguments that would tip the scale to the other side?
It sounds to me as if it would be similar to other concepts already existing in the graph where multiple paths have to be aggregated to arrive at the full truth. I guess from a model perspective it makes sense and I do agree with @apdavison that the advantage of not introducing/materializing potentially conflicting redundancies is preferable. We can think about the resolution to be a "consumer" issue.
Not specific to this topic but a general comment: We should try to find a way to formalize these multi-paths inside the model in a machine readable way soon since their number increases and it would be important for consumers to understand what paths need to be taken into account.
Thanks @olinux for the feedback. It seems like everybody is leaning towards adjusting "species" for specimen(sets) to allow both core/strain and/or controlledTerm/species. So, I will update the schemas accordingly.
I also started testing the schema on some use cases and ran into some problems.
The "geneticStrainType" can be used to indicate where a strain is e.g. a "wildtype", "knock-out", "knock-in", "transgenic", "floxed", etc.
Especially for terms like "floxed" this will not be enough. Example: A strain could have been modified so that a specific gene portion can be conditionally knocked-out. The geneticStrainType could then be "floxed". When knocked-out both the geno- and phenotype of the animal change, but when it's not knocked-out, the phenotype of the animal may be close to the wildtype phenotype of the background strain.
Especially in such cases, these animals are often used as both the control group (not kocked-out) and the test group (with the gene portion knocked-out). This is, in fact, more important than knowing that this is a "conditional ready"-strain.
Since the schema is collecting things like stock numbers, it would seem wrong to just adjust for this by changing the "geneticStrainType" to include this information. Then two strain instances could look like this: name | geneticStrainType (or different name then) | stockNumber |
---|---|---|
conditionalStrainName | conditionally knocked-out | 000123 |
conditionalStrainName | conditional ready or control | 000123 |
So, they could (and should) have the same name and stockNumber but because one of them has been conditionally knocked-out, they need to be captured separately. Principally, it wouldn't be wrong because following the conditional KO, they are indeed genetically different (and fit within the general definition of what a strain is, as far as I can see). But since the conditionally kocked-out animal cannot be bought but needs to be produced by the researcher in the lab, it doesn't seem right to solve it like this. We could remove the stockNumber, but since several experts in the fields think that this is THE most reliable way of capturing strains, it wouldn't be the smart choice.
An alternative could be to add a property to specimen(set) that captures this additional procedural step, called e.g. "geneticState". Then we could have two subjects linking the same strain, but the geneticState tells whether the KO was performed or not. subject name |
strain | geneticState |
---|---|---|
sub01_conditionalControl | conditionalStrainName | control |
sub02_conditionalKO | conditionalStrainName | KO |
(not exactly following subject schema here...)
We could also go into the direction of what we have done with specimen + specimenState but revers the connection. The strain schema collects the attributes that will remain the same at all time and an additional schema "strainState" would give the temporarily true information, but the strainState links to strain so that the strain can be shared by several states. And the states would be the ones linked on the specimen(set). This seems a bit overkill, though...
Any suggestions, ideas, comments: @lzehl, @Majpuc, @apdavison?
To summarize our offline discussion @UlrikeS91 :
We hope you like this solution @Majpuc & @apdavison ?
@lydiang any thoughts on this relative to the NWB genotype schema?
@tgbugs thanks for bringing this up. it would be great if we could directly map to NWB . @lydiang I'm looking forward to your comments :) (the strain schema can be found in here: https://github.com/HumanBrainProject/openMINDS_core/blob/v4/schemas/research/strain.schema.tpl.json
I'll close this issue for now, because we went with the schema https://github.com/HumanBrainProject/openMINDS_core/blob/v4/schemas/research/strain.schema.tpl.json for now.
@lydiang and @tgbugs feel free to raise a new issue for discussion
Trying to use and populate the controlled terms for strains has been a challenge. Based on feedback/concerns from e.g. @Majpuc and following a discussion between @lzehl and me, we are lining strongly towards removing the strains from the controlled terms and instead make them a schema in the core repository (potentially with controlled instances that follow that schema).
I worked on a rough first draft for such a schema:
Some resource I found useful: https://www.ncbi.nlm.nih.gov/books/NBK224550/ http://www.informatics.jax.org/mgihome/nomen/strains.shtml https://www.nap.edu/labcode/search_lc_nodep.php
Any feedback is very welcome! Maybe: @tgbugs, @lzehl, @Majpuc, who else?