legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

Request for clarification #1

Closed adf-ncgr closed 2 years ago

adf-ncgr commented 3 years ago

from https://legumeinfo.org/data/about_the_data_store/templates/template__README.collection_name.yml

genotype:
  - Williams82
# array of Genotype names for this data, if applicable; if a bi-parental map, use Strain1 x Strain2 as a single "genotype" entry

Do we expect (require?) that the value given here will be identical with that in the folder/file naming? Or would we expect (require?) it to match the "strain.name" as per the below (note that the example above doesn't quite match either one)

##### 
strain.identifier:  Wm82
strain.accession:   PI 518671
strain.name:    Williams 82
...

one reason I ask is because of some new genomes that got posted up by Wei. One of these is "kind of Williams82" as indicated by the collection/filenaming Wm82_IGA1008 but the genotype value in the README: https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Wm82_IGA1008.gnm1.5CQQ/README.Wm82_IGA1008.gnm1.5CQQ.yml is just Wm82 and there is no entry yet in https://v1.legumefederation.org/data/v2/Glycine/max/about_this_collection/strains_Glycine_max.yml probably because Wei doesn't know about this additional requirement.

speaking on behalf of the curator's union, I'd like to suggest that we minimize the redundancy if possible or at least clarify how the various redundant bits ought to relate to one another (ideally in such a way that it could be validated)

sammyjava commented 3 years ago

As far as the mines go, it's just free text loaded into a very simple Population object (which is just a holder for the given string). The genotype values are not parsed into strains (which differs from the past). "Kinda sorta Williams82" would be perfectly fine. Here's what's loaded for cowpea genetics:

Population.Identifier
--
189-entry diversity panel
24-125-B-1
524-B
B301
Bambey21
CB27
CB27 x 24-125-B-1
CB27 x IT82E-18
CB27 x IT97K-556-6
CB27 × IT82E-18/Big Buff
CB3
CB46
CB46 x IT93K-503-1
Danila
G32
GreenpackDG
INIA-41
IT00K-1263
IT82E-18
IT84S-2049
IT84S-2049 x UCR779
IT84S-2246
IT84S-2246-4 x TVu-14676
IT89KD-288
IT93K-503-1
IT93K-503-1 x UCR779
IT97K-499-35
IT97K-556-6
IT98K-1111-1
IT99K-573-1-1 x TVNu-1158
IronClay
MAGIC population of 305 F8 RILs
MAGIC-2017
Melakh
Mouride
Moussa
Sanzi
Sanzi x Vita 7
Suvita2
TVu-14676
TVu-15426
TVu-7778
TVu-9522
UC-Riverside diversity minicore
UCR707
UCR779
Vita7
Yacine
ZJ282
ZJ60
ZN016
ZN016 x Zhijiang282
adf-ncgr commented 3 years ago

OK, thanks. I think my request for clarification is mostly intended for @cann0010 but it's good to see how you are handling the situation as it stands. Creating "populations" for accessions ("Sanzi") primarily (I guess) because we are overloading "genotype" in the READMEs seems a bit weird, but I guess I can see why you would do it.

Part of the reason I started this in the first place was because a collaborator pointed out that https://v1.legumefederation.org/data/v2/Glycine/max/genomes/Huaxia3_IGA1007.gnm1.RGGN/README.Huaxia3_IGA1007.gnm1.RGGN.yml has: genotype: Hefeng 25 which is clearly just a mistake (and I hereby let @weihuang12 know), but got me thinking about validation and what @weihuang12 should actually correct it with!

sammyjava commented 3 years ago

Populations can be many things other than single accessions. For example "a whole boatload of plants we picked off this hill" or "192 RILs". So it's a very generic identifier, which allows us to actually load genetic experiments that have all sorts of populations and names for them.

sammyjava commented 3 years ago

That being said, of course, we also have Sanzi loaded as a Strain since it comes in with a genome. The occurrence of "Sanzi" in a genetic genotype is not tied to the strain "Sanzi". This is fine with me since it's actually a pretty special case when you look at the variety of genotyping populations. I think I've fixed all the READMEs for cowpea genetics, but I haven't done the final check of a built mine, since I don't have a built mine quite yet.

adf-ncgr commented 3 years ago

yes, my point was only that we typically don't think of a single accession as a population. there is a sense in which they are. It sounds like you are saying that "Sanzi" the "genetic genotype" is coming in from files under genetics and "Sanzi" the "strain" is coming in from genomes/annotations. It seems logical to want to connect them in some way, but if we don't want to extend the model to bridge the worlds I guess that's fine.

sammyjava commented 3 years ago

So sometimes I list a whole bunch of lines in the genotypes array. So "Sanzi" comes down amongst many. GeneticMap.populations is a collection. And yes, we're not going to try to find little spots of data that happen to match up. I'm having a hard enough time getting stuff to merge as it is.

Whether we do the big list o' genotypes in genetic experiments is another question, maybe we shouldn't do that. Here's where Sanzi comes in, it's from the 37 accessions that were used to build the Cowpea iSelect Consortium Array, which seemed important enough that I listed them:

identifier: iSelect-consensus-2016.gen.Muñoz-Amatriaín_Mirebrahim_2017
synopsis: "Illumina Cowpea iSelect Consortium Array, built from 37 cowpea accessions"
genotype:
 - 24-125-B-1
 - 524-B
 - B301
 - Bambey21
 - CB27
 - CB3
 - CB46
 - Danila
 - G32
 - GreenpackDG
 - INIA-41
 - IronClay
 - IT00K-1263
 - IT82E-18
 - IT84S-2049
 - IT84S-2246
 - IT89KD-288
 - IT93K-503-1
 - IT97K-499-35
 - IT97K-556-6
 - IT98K-1111-1
 - Melakh
 - Mouride
 - Moussa
 - Sanzi
 - Suvita2
 - TVu-14676
 - TVu-15426
 - TVu-7778
 - TVu-9522
 - UCR707
 - UCR779
 - Vita7
 - Yacine
 - ZJ282
 - ZJ60
 - ZN016
sammyjava commented 3 years ago

If you don't like that, I'm happy to change it to "37 cowpea accessions".

adf-ncgr commented 3 years ago

OK, thanks that clarifies things a bit (though I think my initial RFC was really about genotype in genome/annotation files, so this seems like another topic if you are handling genetics READMEs differently in your loading). Don't change anything just yet, this seems like a joint "curation+mine-management" decision for someday soon.

sammyjava commented 3 years ago

FYI READMEs are handled with exactly the same parser across the board. I really really do like to keep things simple, you know.

StevenCannon-USDA commented 3 years ago

just getting to this now .... To the original question, I think I would handle that case as follows, under the assumption that genotype == strain.identifier, which seems to be how we've handled strains to this point: https://legumeinfo.org/data/about_the_data_store/templates/template__strains_Genus_species.yml

README.Wm82_IGA1008.gnm1.5CQQ.yml:
genotype:
  - Wm82_IGA1008

about_this_collection/strains_Glycine_max.yml: 
##### 
strain.identifier:  Wm82_IGA1008
strain.accession:   IGA1008
strain.name:    Williams 82

But I don't know how strains are handled in the mines. For the cowpea "37 accessions" case, is it important to get these represented in the strains file?

sammyjava commented 3 years ago

There is no longer any connection between "genotype" in the README and Strain.identifier. Strain identifiers come from the yuck prefix. Genotype is a very variable thing, and most strains used in genetic experiments have no other data associated with them.

As for the "about this collection" identifier, it MUST match the chunk used in the yuck prefix. Those are used to merge them. And those are pretty much carved in stone at this point since they're in filenames all over the place.

Please keep in mind that ANYTHING we do in the datastore has to be supported by merging rules in the mine loaders. Everything is connected, and the datastore must be viewed as a highly relational database. You do not have the freedom to change something in one place without seeing how it affects the loading of data from the entire datastore.

With that in mind, I am no longer entertaining major changes to datastore specs. I've got months of coding behind the current setup and there will have to be really, really strong reasons to change things. In other words, I have ruthlessly ascended to Datastore Czar -- because I want to get done with the new mine builds, get folks moving to imjs, and move onto something else that isn't so boring.

adf-ncgr commented 3 years ago

this isn't a request for change to the spec, it's a request for clarification on how it is to be interpreted so that curators know how to do the right thing.

adf-ncgr commented 3 years ago

it sounds like from what you've said earlier, you have chosen to interpret it as: -README genotype -> population.identifier (regardless of whether it is a genome/annotation or something in genetics) -yuck prefix component after gensp -> strain.identifier (dictates merge of descriptive info to the genomes pertaining to it but has nothing to do with "genetics" entities)

is that an accurate depiction of your current code logic?

sammyjava commented 3 years ago

Yes. In practice, now, things loaded as Strain have an associated genome assembly. But I load all the Strain entries in the about_this_collection file, whether or not they wind up having assemblies and annotations.

And I do think now that I should stick to crosses or some other identifier "37 Cowpea iSelect lines" to make this clearer. So that's a change, not a clarification.

adf-ncgr commented 3 years ago

a change in the spec or in the way you have curated? If the former, the Czar will have your head! (just like poor old Stepan Razin...)

sammyjava commented 3 years ago

The README template just says to use "Strain1 x Strain2" in the case of biparental crosses, so it could probably use some clarification. But also curation, of course, which I'm going to do. We do have experiments with multiple biparental crosses, so the genotype array is important, but we don't need to list all 37 lines used to create the iSelect ship.

sammyjava commented 2 years ago

I think this is closeable. The bottom line is that the strain name in gensp.strain should be consistent, e.g. "Wm82" or "Tifrunner". The genotypes in the genotype list for mappings, etc. do not have to match those, and are much greater in number than the number of strains for which we have genotype/annotations. So "Williams 82" would be OK (including the space, which we don't use in gensp.strain). There may be some cleanup work, but that usually presents itself in the mines and can be standalone issues.