RFO: Directory and file name structure for gene family collections

StevenCannon-USDA commented 7 months ago

While preparing to calculate a new gene family set, I notice that the 2018 gene family collection uses a naming pattern that is (I think) inconsistent with the general pattern throughout the Data Store. Undoubtedly my fault - though the naming scheme may not have been fully gelled as of early 2018.

The pattern that I think we should be following, for consistency, is: /strain.type.KEY/gensp.strain.type.KEY.filetype.suf Instead, we had /legume.genefam.fam1.M65K/; genefam is extraneous, and mixed is the term we've been used when the strain is ... mixed. The label legume would be appropriate to use in the gensp position in the filenames.

I have provisionally renamed collection as follows:

Directory:
      s/legume.genefam.fam1.M65K/mixed.fam1.M65K/
Files:
      s/legume.genefam.fam1.M65K./legume.mixed.fam1.M65K./

(By "provisionally" I really mean: I've made this change; if there are strong objections, then I'll revert.)

I have not fixed the yuck prefixes within the files. In fact, I hope we can treat these as "legacy" and just replace them with new families, which would follow the naming (and prefixing) pattern above. The new families should be ready with about a week (early February). @adf-ncgr @sammyjava

sammyjava commented 7 months ago

Fine with me! I load 'em whatever they're called! Just be sure to rename the phylotree nodes when you rename the families.

adf-ncgr commented 7 months ago

I don't have any objections, although I have some ambivalence about including a "strain" slot in the id since presumably gene families will always be mixed strain and I think we decided we wouldn't include that in pangene set identifiers for the same reason? I think we argued that there was no compelling reason that identifier schemes for different data types needed to be similar; but I could be misremembering. In any case, I agree we should grandfather the old identifiers in the files- it would be a massive PITA to change them in all of the GFA files and the various places those have been consumed!

StevenCannon-USDA commented 7 months ago

@adf-ncgr - Good point about "mixed" in the collection name. Would be nice for it to be something more useful. I'll hope for inspiration before we need to make the new family collection. (Bonus if you happen to be the source of the inspiration.)

adf-ncgr commented 7 months ago

Well, I don't know how inspiring this is, but all I really meant to suggest is that we remove that "field" from the ids (not replace it with something besides "mixed"). So, just as we have Arachis.pan2 we'd have legume.fam2 and leave it at that. Maybe we could use an extra bit of yuck for some other purpose here, but it doesn't seem necessary to have it be any fixed length to be considered "yucky enough".

StevenCannon-USDA commented 7 months ago

just as we have Arachis.pan2 we'd have legume.fam2 and leave it at that.

That's inspiration enough for me. I'll go with that.

StevenCannon-USDA commented 7 months ago

Adding @svengato here, since this change impacts the Funnotate/phylogram & Lorax.

The "problem" I am trying to address with this renaming is that legume.genefam.fam1.M65K arguably has a duplicative "field", in genefam. The collection name should probably have three parts, like a genome collection, rather than four parts like annotations. I don't particularly care when it comes to the current gene families, but for the new set, I'd like to aim for a consistent, logical naming pattern.

My initial proposal (which I partly "implemented" with a renaming) was e.g., mixed.fam1.M65K/legume.mixed.fam1.M65K.family_fasta.tar.gz -- but I take @adf-ncgr's point that "presumably gene families will always be mixed strain and I think we decided we wouldn't include that in pangene set identifiers for the same reason." Specifically, this form ... genefamilies/legume.fam1.M65K/legume.fam1.M65K.family_fasta.tar.gz ... would be analogous to what we have in the pangene sets: pangenes/Phaseolus.pan2.G5HV/Phaseolus.pan2.G5HV.inclusive_cds.fna.gz

I also take @adf-ncgr's point: "it would be a massive PITA to change them in all of the GFA files and the various places those have been consumed!"

So, I think there are two questions here: (1) Is the proposed naming scheme genefamilies/legume.fam1.M65K/legume.fam1.M65K.....gz acceptable? (2) Should we apply this to the current mixed.fam1.M65K, OR revert to legume.genefam.fam1.M65K ... and consider the reverted collection "grandfathered"?

Whatever we decide on point 2, I am willing to do the renaming (either forward to legume.fam1.M65K, or backward to legume.genefam.fam1.M65K).

[edit - since I can't do things right the first time] [Edited again, to remove "mixed" from the filename in genefamilies/legume.fam1.M65K/legume.fam1.M65K.....gz

sammyjava commented 7 months ago

WARNING: expanding to DS in general

I think the opposite: genome collections are incompletely identified. The collection Hwangkeum.gnm1.4S83 says nowhere that it is, in fact, a genome. Yes, it has an assembly version, gnm1, but that could be ScoobyDoo123 if there were some reason to preserve that assembly version from the original source.

Same with annotation collections. Hwangkeum.gnm1.ann1.1G4F is known to US to be an annotation because it has an extra field with an annotation version and no other collection-defining identifier. But that could also be CruellaDeVille7. We only know it's an annotation because it does NOT have something like .mrk. or .gwas. to indicate that it is something else.

So WE know that Wm82.ScoobyDoo123.CruellaDeVille7.ABCD is an annotation collection, but there is no way anyone could discern that if they weren't working for LIS.

So I think we have some inconsistency/incompleteness in our collection naming which renders those collections non-Findable, which is the first F in FAIR.

(Also, as you know, I think KEY4 is spurious and clutters/complicates our naming. So far I have found no actual purpose for it since the stuff preceding the KEY4 is already unique.)

UPDATE: OK, technically they're findable because a URL is a URL. Could be a random alphanumeric string for that matter. But the collection identifiers do not always self-identify what they are, or to which genus and/or species they belong, for that matter. But they include four characters that serve no identifying purpose whatsoever.

StevenCannon-USDA commented 7 months ago

@sammyjava - I think the counterargument regarding incompletely identified Hwangkeum.gnm1.4S83 is that it sits in Glycine/max/genomes/.

You might say "Yes, but how about if someone receives a bare-naked Hwangkeum.gnm1.4S83.tar.gz, so they don't see that it usually lives in genomes?"

I would say: No problem, because the README within the collection, README.Hwangkeum.gnm1.4S83.yml, describes the contents: synopsis: Glycine max genotype Hwangkeum genome assembly v1.0

And my argument again for the utility of 4S83 here: It is a funky string that aids users in Findability and provenance. If someone stumbles across a file glyma.Hwangkeum.gnm1.4S83.genome_main.fna, a search of 4S83 is very likely to help them find the associated metadata, which says what this file is, what its predecessor is, where the predecessor came from, and how we modified it. Testing this: a Google search on Hwangkeum.gnm1.4S83 takes us to https://soybase.org/data/v2/Glycine/max/genomes/

adf-ncgr commented 7 months ago

couple of quick comments (NOT intended to prolong the agony!):

in the proposed (1) Is the proposed naming scheme genefamilies/legume.fam1.M65K/legume.mixed.fam1.M65K.....gz acceptable? I think (hope) the inclusion of "mixed" in the file name was an oversight?
I was under the impression that if someone wanted to version their genome as "ScoobyDoo" we would at a minimum call it "gnmScoobyDoo", ie the gnm and ann bits are actually considered required parts of the naming scheme.
given that genomes/gnm in our paths are arguably redundant, we may want to revisit why we elide the "gensp" from our file naming conventions (for reasons that will described elsewhere in due time...)

sammyjava commented 7 months ago

But they don't know where to get the file from. And computer programs shouldn't have to parse the internals of READMEs to determine the provenance of identifiers. Etc.

I think you're thinking in terms of human beings looking at files and directories, not automated processes. I write automated processes and I find a lot of difficulty with these issues. Yes, I can drill down to a README because I'm a human being reading it off of a URL in my browser, but that's not what I'm talking about. I'm talking about well-self-identifying identifiers. That's all. From the identifier of a tarball (as you suggest) one should be able to say: "this is a genome assembly for the genus Glycine, species max, accession Williams82, version gnm1" simply from the fields in the identifier.

I don't expect to convince you of any of this. But you know how I feel about it.

P.S. You'll never convince me that KEY4 is really useful and worthy of a field in our identifiers. So I'll just drop that complaint.

sammyjava commented 7 months ago

I was under the impression that if someone wanted to version their genome as "ScoobyDoo" we would at a minimum call it "gnmScoobyDoo", ie the gnm and ann bits are actually considered required parts of the naming scheme.

Yeah, but that's totally inconsistent with the practice for diversity, expression, gwas, maps, markers, pangenomes, qtls, synteny, etc.

Why is it OK to use inconsistent naming syntax at LIS?

StevenCannon-USDA commented 7 months ago

@adf-ncgr - yep, my error. I intended genefamilies/legume.fam1.M65K/legume.fam1.M65K.....gz?
"the gnm and ann bits are actually considered required parts of the naming scheme" -- I consider them so.
"we may want to revisit why we elide the "gensp" from our file naming conventions" -- you mean: why the gensp is not included in the metadata files, e.g. README.CDCFrontier.gnm3.QT0P.yml? - I now agree that this is an inconsistency. Whether the change is worth the effort ... I don't know.
@sammyjava - because I am terrified of hobgoblins :-). "A foolish consistency is the hobgoblin of little minds." Like garlic for vampires, a little inconsistency helps keep them away. (Too much inconsistency drives the good hobgoblins away, so we try to strike a balance.)

sammyjava commented 7 months ago

@sammyjava - because I am terrified of hobgoblins :-). "A foolish consistency is the hobgoblin of little minds." Like garlic for vampires, a little inconsistency helps keep them away. (Too much inconsistency drives the good hobgoblins away, so we try to strike a balance.)

FIne. I'm out.

StevenCannon-USDA commented 7 months ago

FIne. I'm out.

In jest I hope. Seriously, I'd like to try for more consistency -- and in fact, that's the intent of this RFO; but the real balance to be struck is with tradeoffs between semantically opaque UUIDs and semantically meaningful ones - at the cost of the potential for length and messiness of the meaningful identifiers. Things like human names and quirky publication choices give us things like BenningPI595645_x_DanbaekkongPI619083.qtl.Warrington_Abdel-Haleem_2015 and LD00-2817P_x_LDX01-1-65.qtl.Valdés-López_Thibivilliers_2011. The scheme works, but some of these names make SM5H appealing.

I am also open to large, fundamental changes -- but recognizing that larger changes have larger implementation costs.

StevenCannon-USDA commented 7 months ago

Returning to the spirit of the RFO, I propose to complete the gene family renaming, including all files within tarballs, to: genefamilies/legume.fam1.M65K/legume.fam1.M65K.....gz ... and name the next collection as genefamilies/legume.fam2.KEY4/legume.fam1.KEY4.....gz

I am open to objections, including reverting to the previous pattern (effectively, adding back the "extra" field); but absent objections, I'll go ahead with the renaming - probably at the end of the week.

I am also open to counter RFOs (or proposals/discussions) to deal with the inconsistencies that you've raised, datastore-wide, @sammyjava - but I think those would be better discussed in a separate issue.

adf-ncgr commented 7 months ago

Why is it OK to use inconsistent naming syntax at LIS?

Hmm, I was just trying to note that someone could know gnm1 is a genome if they had its files independent of the context of datastore structure because "gnm" is supposed to be the part that provides the info (the fact that it is followed by version info in the case of genomes and not in other cases like "mrk" is just because at one point we all agreed that different mrk sets aren't versions of one another). I think all of the datatypes are consistent in that the datatype indicated by the containing folder is given some sort of representation in the filenames (e.g. "genomes" -> gnm+version, "markers" -> mrk, "pangenomes" -> pan+version...)

@StevenCannon-USDA regarding the "gensp", the initial driver for considering this change was actually https://github.com/legumeinfo/microservices/issues/616 after I realized that there's no way for us to know that LD00-2817P_x_LDX01-1-65.qtl.Valdés-López_Thibivilliers_2011 should link to glycinemine (as lovely as that identifier is in all other respects!) It's probably not an insurmountable difficulty, but it would be nice to be able to have the linkout specification be something like "any request for a QTL that is prefixed by glyma should go to https://mines.legumeinfo.org/glycinemine/qtlstudy:${STUDY_ID}" rather than having to do it per qtlstudy.

But @sammyjava raised the additional point that it's hard to recognize Hwangkeum.gnm1.4S83 as a soybean genome without having recourse to some additional level of indirection, which is arguably unFAIR at some level. I do think it's recognizable from the id as a genome ("gnm") just not a glyma gnm. The linkout service hasn't had a problem with gnm/ann being this way because it operates off the identifiers of the contents of the files, not the files themselves, and we do inject the gensp there.

The linkouts issue could potentially be solved other ways, but it seemed worth revisiting the naming- after all, as far as revisiting naming, you started it! ;)

StevenCannon-USDA commented 7 months ago

@adf-ncgr - so is the proposal: glyma.LD00-2817P_x_LDX01-1-65.qtl.Valdés-López_Thibivilliers_2011/ and glyma.Hwangkeum.gnm1.4S83/ ... and the primary rationale to help guide the linkout service to GlycineMine in this case?

adf-ncgr commented 7 months ago

I would have to defer to @sammyjava as to the exact requirement, since it's not %100 clear to me where he gets the identifier for the QTL/GWAS studies from (ie the thing that the trait search tool gets from the mine and will pass along to the linkout service). I think it is actually the identifier attribute given in the README, but for validation's sake that value is also required to match the folder name. And once you change the folder name, adding it to the file names of README (and similar) files seems to follow logically (and makes the naming within folders more consistent). The initial rationale was to help the linkout specification, and is also my primary rationale. But I have some ambivalence about sweeping changes as well and am totally open to further discussion about the cost/benefit of other approaches.

adf-ncgr commented 7 months ago

no objection from me to the proposed naming change for gene families (just as long as you don't really put "legume.fam1.KEY4.....gz" under "legume.fam2.KEY4" :) )

legumeinfo / datastore-issues

RFO: Directory and file name structure for gene family collections #192