legumeinfo / glycinemine

An InterMine for Glycine species
GNU Lesser General Public License v3.0
0 stars 1 forks source link

Display human friendly GWAS dataset names #27

Closed adf-ncgr closed 4 years ago

adf-ncgr commented 4 years ago

from @cann0010 and @maxglycine The GWAS dataset title (KGK20170808.1): preferable would be a human-readable alias, e.g. "Zhang et al. 2016a"

sammyjava commented 4 years ago

@cann0010 Decide exactly what you want for this alias, and I'll add it to the GWAS file format as a new parameter. It's difficult and probably unreliable to form "Zhang et al. 2016" from the related Publication, so I'll make an explicit new "alias" parameter for display purposes.

StevenCannon-USDA commented 4 years ago

Enlisting Rex's help on this, as this alias is for consistency with SoyBase. (Not yet a member of this project; have reinvited, and will email separately.)

sammyjava commented 4 years ago

Well, let's not use SoyBase as the guiding light on this spec, since we'll have GWAS from sundry other legumes as well. Let's use something that's fundamentally good. We've got the SoyBase ID (KGK20170808.1) for linking over to SoyBase, which I'll implement; the mine presentation should be designed for any and all species.

sammyjava commented 4 years ago

That being said, I think Author, et al Year is probably the only thing we can come up with that is common to all datasets, whatever their particular unique identifier happens to be.

StevenCannon-USDA commented 4 years ago

If it's used for display rather than link-outs (i.e. we have some flexibility), then my preference would be to follow the pattern that we've used at LegumeInfo and PeanutBase for citations: first two authors and year, with year always having a trailing letter, e.g. [Smith, Brown et al., 2015a] or [Checa and Blair, 2008a] or [Anderson, 2019a]. The reason for the trailing letter is that Smith and Brown may be on a roll and generate two first-author publications in 2015. See other examples in the left column here: https://peanutbase.org/search/qtl or https://legumeinfo.org/search/qtl This form has the advantage of being what we're collecting in the collection template.

sammyjava commented 4 years ago

Well that means the datasets are interconnected: "b" means another exists with "a". We don't want to interconnect the datasets. They should stand alone. You should be able to just add one without having to look at others. So ix-nay on the year-end letters, the alias should be a standalone thing that is created without dependence on the number of GWAS papers Smith and Brown published.

sammyjava commented 4 years ago

Note that this is an ALIAS -- the actual unique identifier is something else. So there's no problem if Smith and Brown published five GWAS papers in 2015. Each one will say "Smith, Brown et al. 2015" as a descriptive element, but the GWAS datasets must have a unique identifier which could simply be a KEY4 from the LIS key repo.

StevenCannon-USDA commented 4 years ago

ix-nay the etter-lay - ok. Pretty rare edge case, I guess.

sammyjava commented 4 years ago

Erp, can't be the KEY4, that's in the directory name that holds all the GWAS datasets. Could be anything, though.

sammyjava commented 4 years ago

For reference, this is the metadata so far in the GWAS files:

TaxonID 3847
Name    LBC20180625.3
PlatformName    SoySNP50K
PlatformDetails SoySNP50K iSelect Bead Chip
DOI 10.1186/s12864-016-2487-7

this one would add

Alias   Chang, Brown et al. 2016

I think I would change "Name" to "Identifier" (more consistent with other mine classes) and "Alias" to "Name" (also consistent).

Alias can be anything of course, if you want "Yo momma's favorite GWAS", so be it.

adf-ncgr commented 4 years ago

friends, I'm a little unclear on whether we are talking about: a) something that will be done programmatically by a loader based on some other piece of info already in place b) something that a curator will do hopefully adhering to some guidelines in the latter case, I think we're left with pretty much "anything goes" (like Sam's example above)

sammyjava commented 4 years ago

It's a piece of metadata I add to the GWAS files. So anything goes, but I'd like Steven to tell me what that "anything" is. That's all.

sammyjava commented 4 years ago

And I think we can just go with "First, Second et al. Year" and be done with it, I had just brought this up since Steven said "e.g." and I wanted an actual definition.

adf-ncgr commented 4 years ago

apparently, you're the curator so "anything (you say) goes"

sammyjava commented 4 years ago

I think I'll name them after English Football League teams.

StevenCannon-USDA commented 4 years ago

If the Alias/Name is just for display, then potential collisions aren't a stopper, and we can go with the two-authors-and-year protocol (can flesh it out in excruciating detail, but the idea is simple).

For the Identifier, that does need to be unique. For linking to existing SoyBase records, it should be whatever SoyBase has used (I don't know what that naming algorithm is, but it looks like type.date.number: LBC20180523.4). For existing records in LegumeInfo and PeanutBase, Chado isn't much help: it just assigns an auto increment ID.

So, for a human-assigned Identifier, I would assign a license plate ... which requires a registry of some sort. I guess I would recommend the Data Store registry - since we aren't going to run out of 26^4 strings. This would just require noting which keys are being used.

adf-ncgr commented 4 years ago

@cann0010 I may be confused but it sounds like you are going against the comment @sammyjava made earlier

Erp, can't be the KEY4, that's in the directory name that holds all the GWAS datasets. Could be anything, though.

and suggesting we use "license plates" (aka KEY4) not only for the gwas folder itself: Glycine_max/mixed.gwas1.1W14 but also for the files within it: glyma.mixed.gwas1.1W14.KGK20170707-1.gwas.tsv glyma.mixed.gwas1.1W14.KGK20170711-1.gwas.tsv glyma.mixed.gwas1.1W14.KGK20170714-1.gwas.tsv ...

so that in addition to the "traditional" use of the folder level KEY4 (here: 1W14) we would also have KEY4s (from the same registry) used as identifiers in place of what is above shown as "KGK20170707-1"?? (at least in cases where we are not inheriting someone else's human-assigned identifiers). If that's correct, let's plan to discuss on Wednesday so everyone's on the same page. If that's not correct, I guess please clarify now or else on Wednesday.

At least we are all clear now that they MAY be labelled as "Manchester, United 2010b" :)

StevenCannon-USDA commented 4 years ago

Not going against the comment that "can't be the KEY4, that's in the directory name that holds all the GWAS datasets," but suggesting an additional use of the KEY4. That is: if we are looking for unique identifiers, the conventional options are a numeric key or a random string. I don't have a strong preference; we just need a way to ensure uniqueness.

adf-ncgr commented 4 years ago

OK, I guess I wasn't confused (though perhaps I stated my understanding of what you were suggesting badly). It personally seems confusing to me to use the same registry to supply random strings used for slightly different purposes, but as long as curators can understand and follow the rules I guess I don't see the proposal leading to any actual problem. That's a big "as long as," though.

sammyjava commented 4 years ago

Nahhhh: we should create a separate registry for unique identifiers of datasets, not the KEY4 for directories. It's a different thing and we'll cause massive confusion and long LIS meeting discussions. It's easy, I'm happy to do it: I'll implement it with other than four characters (say three or five) to make clear it's a different type of identifier. There are plenty in stock.

The unique SoyBase IDs are already in use, nothing changes for SoyBase GWAS. So all I do is add the agreed-upon user-friendly name/alias from the publication. When more GWAS comes down from other places we'll make use of the dataset identifier registry.

And @adf-ncgr Man U is in the EPL, not the EFL. My team, Sheffield Wednesday is in the EFL. Surprisingly, our Steel City rivals, Sheffield United, are in the EPL this year and doing quite well, the wankers.

I'm closing this issue since I can now implement the solution. Go Owls!!!!

adf-ncgr commented 4 years ago

I think you make a great hooligan. For my further edification, is the scope of uniqueness of the new identifiers of your proposed solution just the specific GWAS folder, or will it be unique across GWAS folders, or some larger scope (ie any dataset that won't get its own folder but needs an identifier will drink from this well)?

sammyjava commented 4 years ago

It'll be unique in the entire LIS universe. Like the KEY4 directory identifiers. Because why not. Yes, I was thinking of a general dataset unique identifier repository, not just for GWAS.

sammyjava commented 4 years ago

BUT NOT INCLUDED IN THE FILE NAMES, JUST THE METADATA!!! 'cause I'd have to shoot myself if we got into the file naming thing with this.

adf-ncgr commented 4 years ago

I was about to +1 that, but on second thought it seems inconsistent with how the soybase files are being handled (ie they have their datasets in the filenames, so why wouldn't the non-soybase). @sammyjava if you're thinking about shooting yourself now, please call me - you're not alone ;)

sammyjava commented 4 years ago

Ahh good point, that's the only thing that gives them different names. So in general, if a directory holds a bunch of similar files, like GWAS, they have to have different filenames, so it may as well be the unique identifier. @cann0010 you may not be aware of this, but this is how the GWAS files are stored, using the identifier in their names to make them distinct files. For GWAS from other sources I'd put the "dataset registry identifier" in there between KEY4 and gwas.tsv if there was no identifier provided by the source (as there is by SoyBase).

glyma.mixed.gwas1.1W14.KGK20170707-1.gwas.tsv
glyma.mixed.gwas1.1W14.KGK20170711-1.gwas.tsv
...
glyma.mixed.gwas1.1W14.SST20180209-1.gwas.tsv

Now I think we're done!!! Go Owls! Screenshot_2020-05-11 Sheffield Wednesday F C - Wikipedia

StevenCannon-USDA commented 4 years ago

Accepted. Or "Go Bears" (to quote from Fargo)