openMetadataInitiative / openMINDS_controlledTerms

Metadata model for the consistent registration of well-defined terms as well as a corresponding library of terminologies (including links to ontological terms where applicable).
MIT License
7 stars 12 forks source link

Reconsidering strains #58

Closed UlrikeS91 closed 2 years ago

UlrikeS91 commented 2 years ago

Trying to use and populate the controlled terms for strains has been a challenge. Based on feedback/concerns from e.g. @Majpuc and following a discussion between @lzehl and me, we are lining strongly towards removing the strains from the controlled terms and instead make them a schema in the core repository (potentially with controlled instances that follow that schema).

I worked on a rough first draft for such a schema:

Property Count Expected value Notes/Thoughts/Comments
Name 1 – 1 free text this needs to be required
Synonyms 0 – N free text could be nice to have, not crucial though
Description 0 – 1 free text could be nice to have; I envisioned a description of what the strain is, which may or may not contain some of the field below
Species 1 – 1 controlledTerms/Species unsure if this is needed, but if we keep it/find it important, we need to see how it works together with the species for the specimen (e.g. rather other way around, as embedded type for species schema)
Identifier 0 – N free text such as MGI, MGD, RGD, RRID; here we need to see which ones and RRID may needs to be split out as its own property instead if we allow free tezt but introduce a RRID schema (see https://github.com/HumanBrainProject/openMINDS_core/pull/242)
preferredOntologyIdentifier 0 – 1 IRI
background 1 – 1 controlledTerms/BackgroundStrainType such as inbred, outbred, hybrid, mixed inbred, segregating inbred, recombinant inbred, collaborative cross/multiparental recombinant inbred, recombinant congenic, congenic, advanced intercross lines, etc.; the property name is probably really bad, but I couldn't think of anything better
GeneticBackground 1 – 1 controlledTerms/GeneticStrainType such as wildtype, spontaneous mutation, induced mutation, KO, KI, floxed, transgenic, etc.; same comment as above, property name could use an update
phenotypicDescription 0 – 1 free text or controlledTerms/phenotype I'm leaning towards free text, but that would also mean that we need to reconsider controlledTerms/phenotype
diseaseModel 0 – 1 controlledTerms/diseaseModel this would probably be described under phenotype to some degree but might be nice to have since we do actually have the controlledTerm for it
stockNumber 0 – 1 free text or regex I think depending on the producer/holding site, these stock numbers are different, so creating a regex is probably not possible; also don't think that they are unambiguous without the producer/holding site (i.e. lab code belowe, but maybe we need combine them into its own schema?)
LaboratoryCode 0 – 1 regex (first letter uppercase, followed by all lowercase from ILAR lab code registry; also see comment above

Some resource I found useful: https://www.ncbi.nlm.nih.gov/books/NBK224550/ http://www.informatics.jax.org/mgihome/nomen/strains.shtml https://www.nap.edu/labcode/search_lc_nodep.php

Any feedback is very welcome! Maybe: @tgbugs, @lzehl, @Majpuc, who else?

lzehl commented 2 years ago

@UlrikeS91 & @Majpuc thanks for this issue. Here my feedback:

Rest looks good to me :)

lzehl commented 2 years ago

one thing: we could change "preferredOntologyIdentifier" to just "ontologyIdentifier" and allow multiple entries

UlrikeS91 commented 2 years ago
  • I think "species" should not be part of this schema since this schema will not stand alone but always be connected to a subject which has already a species attached. We can also keep it, but that would mean another graph validation point.

I'll see what Maja says, but I think you are right.

  • "background" why not "breedingBackground" ? (only background I found confusing)

Yeah, like I said in the comment, I was pretty sure that this is a bad name for it :D I'll see if Maja has a better suggestio but "breedingBackground" would already improve it, I think.

  • "geneticBackground" why not "geneticStrainType" ? (I do not have a strong opinion here)

Same as above.

  • "phenotypicDescription" why not "phenotype" ? and leave it as free text (for now; note: we can remove "phenotype" from specimen, correct?)

If we were to keep "phenotype" as a controlledTerm schema and property for specimen (called "phenotype"), we would need to remain consistent about this. ONLY if we remove phenotype all together (schema and link on specimen), then we can use "phenotype" with free text as expected value.

one thing: we could change "preferredOntologyIdentifier" to just "ontologyIdentifier" and allow multiple entries

I like that suggestion, but we have use "ontologyIdentifier" (expecting IRI) for many of the SANDS schemas only allowing 0 - 1. Maybe we should stick to that or we would need to adjust the SANDS schemas. Either would work for me.

lzehl commented 2 years ago

I guess if we just ask for "ontologyIdentifier" we should always allow multiple and only when we ask for "preferredOntologyIdentifier" we ask for 1

For phenotype: is there any reason to keep the property on the subject if that is covered in the strain schema? or differently asked: is the phenotype something independent of a strain definition? then it should be kept on the subject and removed from strain schema (the combination would still be in place); if the phenotype is something solely depending on a strain definition it should be kept for the strain schema and removed from the subject

UlrikeS91 commented 2 years ago

Following a very frutiful meeting with @Majpuc and the feedback/suggestions in this thread, I adjusted the draft:

Strain

Property Count Expected value Notes/Thoughts/Comments
Name 1 – 1 free text unchanged
Synonyms 0 – N free text unchanged
Description 0 – 1 free text unchanged
Identifier 0 – N free text OR combination of "core/stringParamter" & other identifier-schemas (e.g. RRID) thought with the stringParameter is that it would make it more specific what the identifier is since we would expect MGI, MGD, RGD and more here, many are unique and rather specific, but I'm unsure if this applies to all
ontologyIdentifier 0 – N IRI as suggested by @lzehl (and liked by @majpuc)
backgroundStrain 0 - 2 core/Strain suggested by @majpuc, each strain displays a specific set of traits and when strains were crossed, the main traits may come from one or the other or share 50/50, but when e.g. backcrossing to go back to the main trait coming only from one strain, this would be the backgroundStrain (so the strain that causes the majority of the prominent trait)
breedingType 0 - 1 controlledTerms/(strain)BreedingType this contains what I envisioned in the previous "background", so e.g. inbred/outbred/etc.
geneticStrainType 1 – 1 controlledTerms/GeneticStrainType only property name changed following suggestion from both @lzehl and @majpuc (independently even)
phenotype 0 – 1 free text property name changed (both @lzehl & @majpuc commented on it), @majpuc and I landed on free text being better than a controlledTerm, could be discussed again
diseaseModel 0 – 1 controlledTerms/diseaseModel unchanged
stockNumber 0 – 1 core/stockNumber @majpuc pointed out that a stock number alone is too ambiguous, it should always come with the vendor information (see suggestion for stockNumber-schema below)
LaboratoryCode 0 – 1 regex (first letter uppercase, followed by all lowercase) kept as is BUT my definition kind of changed because @majpuc commented on it and is right about that; what I envisioned was that this goes hand-in-hand with stockNumber, but not all vendors are potential registered in the ILAR and it seems more intuitive that this refers to the person/organisation that has actually CREATED the strain (which is also what it is mostly used for and not to express who has the strain in stock); property could also be renamed to something like "Creator" or similiar

StockNumber

Property Count Expected value Notes/Thoughts/Comments
Identifier 1 – 1 free text I don't believe it is possible to add a regex here (vendors handle them too differently)
Vendor 1 - 1 core/organization I think organization should cover it, but feel free to suggest changes :D

About the phenotype on specimen: Theoretically, every subject has a phenotype but I don't know how useful this would be. For humans, this is never stated. How is it for e.g. monkeys? If I'm not mistaken they would typically also not state a strain, but what about a phenotype? I have only ever came in contact with phenotypes on a subject level when a strain was relevant in the first place...

UlrikeS91 commented 2 years ago

We now landed on the following:

Strain

Property Count Expected value controlledTerms examples
name 1 – 1 free text
synonym 0 – N free text
description 0 – 1 free text
identifier 0 – N free text (expects e.g. MGI, MGD, RGD)
RRID 0 – 1 core/RRID
ontologyIdentifier 0 – N IRI
backgroundStrain 0 - 2 core/Strain
breedingType 0 - 1 controlledTerms/breedingType inbred, outbred, hybrid, etc.
geneticStrainType 1 – 1 controlledTerms/GeneticStrainType wildtype, transgenic, knock-out, knock-in, etc.
phenotype 0 – 1 free text
diseaseModel 0 – 1 controlledTerms/diseaseModel see existing instances
stockNumber 0 – 1 core/stockNumber (embedded)
laboratoryCode 0 – 1 regex (first letter uppercase, followed by all lowercase)

StockNumber

Property Count Expected value
identifier 1 – 1 free text
vendor 1 - 1 core/organization

In addition, I will:

  1. remove "phenotype" from specimen (at least in the EBRAINS context it was never used there and we could come up with any relevant examples otherwise either)
  2. remove controlledTerms/phenotype (will be removed from specimen and here it is free text, so this isn't necessary anymore)
  3. allow for "ontologyIdentifier" in other schemas also 0 - N to be consistent across all schemas
apdavison commented 2 years ago

sorry to chime in rather late, but I would like to argue for restoring "species" as a required field. When building interfaces, it would be nice to use this schema to present a list of strain choices to the user. To narrow down the list of strains we need to be able to filter by species.

UlrikeS91 commented 2 years ago

sorry to chime in rather late, but I would like to argue for restoring "species" as a required field. When building interfaces, it would be nice to use this schema to present a list of strain choices to the user. To narrow down the list of strains we need to be able to filter by species.

No worries and it's not too late yet ;)

We kicked it out because specimen have properties for species and strain. If the strain includes a species as well, we introduce yet another graph validation point. But since species is a required field for specimen, it means that every specimen that has a strain also has to have a species. So, it would be possible to narrow down the list of strains based on species. Or would you say that this is not enough?

apdavison commented 2 years ago

We kicked it out because specimen have properties for species and strain. If the strain includes a species as well, we introduce yet another graph validation point.

I understood the reasoning. I don't understand how big a problem it is to have another graph validation point.

But since species is a required field for specimen, it means that every specimen that has a strain also has to have a species. So, it would be possible to narrow down the list of strains based on species. Or would you say that this is not enough?

These schemas will be used independently of Specimen, for example in the case of models, or in building user interfaces.

lzehl commented 2 years ago

@apdavison & @UlrikeS91 I think Andrew raises a valid point if a strain with a specific stock number be used by two different labs (@UlrikeS91 or @Majpuc is this true?). And we have enough graph validation points that we anyway have to deal with at one point. So from my side, feel free to reintroduce "species"

Majpuc commented 2 years ago

Hi, yes it is thinkable that several labs can use the same strain from one specific vendor and thereby having the same stock number. I must admit I don't fully grasp the implications of reintroducing "species" or not and leave that to you to decide.

UlrikeS91 commented 2 years ago

@apdavison & @UlrikeS91 I think Andrew raises a valid point if a strain with a specific stock number be used by two different labs (@UlrikeS91 or @Majpuc is this true?).

Yes, very much possible and for wildtype strain I would even call it more likely than the other way around.

Edit: I don't see why this is relevant though because strain is not an embedded type for specimen(sets), so it could be "recycled" for several specimen(sets) anyway. Only validation point already existing this way, would be if dataset A has a subject with mus musculus as spieces and links C57BL but dataset B a subject with rattus norvegicus and also links to C57BL...

And we have enough graph validation points that we anyway have to deal with at one point. So from my side, feel free to reintroduce "species"

Maybe this is an alternative that solves the validation issue and the more direct linkage of strain and species:

Single specimen would have 1 - 1, either stating only a species (e.g. homo sapiens) or a strain that has the species (e.g. C57BL with species: mus musculus) specimen sets would have 1 - N, where a mix can be used (e.g. a comparative dataset that has both human and mouse data, the subject group would then have homo sapiens (species) + C57BL/mus musculus (strain))

lzehl commented 2 years ago

and how would you call the corresponding property?

lzehl commented 2 years ago

I would just have it twice then, on specimen(set) and on strain. It makes respective queries easier (query for strains only by species, query for subjects by species). And the plan is that we introduce a graph validation

UlrikeS91 commented 2 years ago

and how would you call the corresponding property?

Found this online: "[...] strain is a sub-type of a genetic variant of biological species."

So, sticking to "species" as a property name could work

lzehl commented 2 years ago

@olinux and @apdavison what do you think about @UlrikeS91 suggestion?

apdavison commented 2 years ago

I like it. The disadvantage is that it makes it a bit more complex to retrieve the species from a specimen, since you have to handle both possibilities (specimen->species and specimen->strain->species). The advantage is that it avoids the redundancy (and source of possible conflicts) of having the species in two places in the graph.

On balance I think the advantages outweigh the disadvantages.

UlrikeS91 commented 2 years ago

I like it. The disadvantage is that it makes it a bit more complex to retrieve the species from a specimen, since you have to handle both possibilities (specimen->species and specimen->strain->species). The advantage is that it avoids the redundancy (and source of possible conflicts) of having the species in two places in the graph.

On balance I think the advantages outweigh the disadvantages.

Thank you, @apdavison.

@olinux, do you have an arguments that would tip the scale to the other side?

olinux commented 2 years ago

It sounds to me as if it would be similar to other concepts already existing in the graph where multiple paths have to be aggregated to arrive at the full truth. I guess from a model perspective it makes sense and I do agree with @apdavison that the advantage of not introducing/materializing potentially conflicting redundancies is preferable. We can think about the resolution to be a "consumer" issue.

Not specific to this topic but a general comment: We should try to find a way to formalize these multi-paths inside the model in a machine readable way soon since their number increases and it would be important for consumers to understand what paths need to be taken into account.

UlrikeS91 commented 2 years ago

Thanks @olinux for the feedback. It seems like everybody is leaning towards adjusting "species" for specimen(sets) to allow both core/strain and/or controlledTerm/species. So, I will update the schemas accordingly.

I also started testing the schema on some use cases and ran into some problems.

The "geneticStrainType" can be used to indicate where a strain is e.g. a "wildtype", "knock-out", "knock-in", "transgenic", "floxed", etc.
Especially for terms like "floxed" this will not be enough. Example: A strain could have been modified so that a specific gene portion can be conditionally knocked-out. The geneticStrainType could then be "floxed". When knocked-out both the geno- and phenotype of the animal change, but when it's not knocked-out, the phenotype of the animal may be close to the wildtype phenotype of the background strain. Especially in such cases, these animals are often used as both the control group (not kocked-out) and the test group (with the gene portion knocked-out). This is, in fact, more important than knowing that this is a "conditional ready"-strain.

Since the schema is collecting things like stock numbers, it would seem wrong to just adjust for this by changing the "geneticStrainType" to include this information. Then two strain instances could look like this: name geneticStrainType (or different name then) stockNumber
conditionalStrainName conditionally knocked-out 000123
conditionalStrainName conditional ready or control 000123

So, they could (and should) have the same name and stockNumber but because one of them has been conditionally knocked-out, they need to be captured separately. Principally, it wouldn't be wrong because following the conditional KO, they are indeed genetically different (and fit within the general definition of what a strain is, as far as I can see). But since the conditionally kocked-out animal cannot be bought but needs to be produced by the researcher in the lab, it doesn't seem right to solve it like this. We could remove the stockNumber, but since several experts in the fields think that this is THE most reliable way of capturing strains, it wouldn't be the smart choice.

An alternative could be to add a property to specimen(set) that captures this additional procedural step, called e.g. "geneticState". Then we could have two subjects linking the same strain, but the geneticState tells whether the KO was performed or not.
subject name
strain geneticState
sub01_conditionalControl conditionalStrainName control
sub02_conditionalKO conditionalStrainName KO

(not exactly following subject schema here...)

We could also go into the direction of what we have done with specimen + specimenState but revers the connection. The strain schema collects the attributes that will remain the same at all time and an additional schema "strainState" would give the temporarily true information, but the strainState links to strain so that the strain can be shared by several states. And the states would be the ones linked on the specimen(set). This seems a bit overkill, though...

Any suggestions, ideas, comments: @lzehl, @Majpuc, @apdavison?

lzehl commented 2 years ago

To summarize our offline discussion @UlrikeS91 :

We hope you like this solution @Majpuc & @apdavison ?

tgbugs commented 2 years ago

@lydiang any thoughts on this relative to the NWB genotype schema?

lzehl commented 2 years ago

@tgbugs thanks for bringing this up. it would be great if we could directly map to NWB . @lydiang I'm looking forward to your comments :) (the strain schema can be found in here: https://github.com/HumanBrainProject/openMINDS_core/blob/v4/schemas/research/strain.schema.tpl.json

lzehl commented 2 years ago

I'll close this issue for now, because we went with the schema https://github.com/HumanBrainProject/openMINDS_core/blob/v4/schemas/research/strain.schema.tpl.json for now.

@lydiang and @tgbugs feel free to raise a new issue for discussion