RFO: pangene specification and collections

StevenCannon-USDA commented 1 year ago

Request For Objections for a new specification for pan-gene collections.

Here is the draft spec, at datastore-specifications under Genus/GENUS/pangenes/.

The spec describes a set of initial pan-gene collections: Cicer, Glycine, Medicago, Phaseolus, and Vigna.

See the collections at the respective locations in the Data Store.

The pan-gene sets were generated using pandagma.

adf-ncgr commented 1 year ago

couple of minor points:

if we want to coin gensp-like prefixes for genera, we should add them to the analogous location as we have done for species, i.e. an abbrev attribute in the description_Genus.yml; I'm personally not in favor of using 5 letter here, since it seems to muddy the waters a bit when it comes to species like "Arachis chiquitana" or "Vigna nakashimae" (and what do we do with "Lens"?)
as far as the identifiers in the hsh/clst files, it looks like they are protein/transcript ids, not gene ids; do we want to leave it as an exercise for the user to get gene ids? (not always as simple as trimming a suffix) I'd vote for doing something along the lines of the gfa files where both are provided explicitly

not sure these are Objections as much as Quibbles, but respectfully submitted for consideration before the bill becomes a law.

StevenCannon-USDA commented 1 year ago

"I'm personally not in favor of using 5 letter here" -- I'm actually with you on this. My preference would be to use the full genus name. The procrustean bed gives an uncomfortable "glyci" and "arach", for example. And, yes, "lensx"?? Possibly a minor thing, but this would require changing the requirement in validate.sh .

sammyjava commented 1 year ago

I say just leave the capitalized Genus as the prefix. Unless we're set on no caps, then lower-case it. But there's no reason to stick with some fixed number of characters. Phaseolus.pan1.pan00001 or phaseolus.pan1.pan00001.

adf-ncgr commented 1 year ago

I like using first letter capitalized here; if nothing else, it would obviate the eventual collision of "vigna" as both a genus and a gensp (in the case of Vigna nakashimae). Also it makes them seem important. We could even use the voice of ultimate authority on the internet: capslock (VIGNA!) ;)

adf-ncgr commented 1 year ago

starting to take a gander at mixed.pan3.YWTW/README.mixed.pan3.YWTW.yml with a view to getting together a GCV instance using these pangene assignments, I noticed that the list of genotypes provided doesn't include any indication of their species (or version). Seems like it would be convenient for READERs of the metadata for these mixed sets to change it to something more explicit of the full yuck to be found in the data files. Simplest could be moving from: genotype:

Amsoy
DongNongNo_50
Wm82
G1134
G1267 to: annotation:
glyma.Amsoy.gnm1.ann1
glyma.DongNongNo_50.gnm1.ann1
glyma.Wm82.gnm2.ann1
glyma.Wm82.gnm4.ann1
glydo.G1134.gnm1.ann1
glycy.G1267.gnm1.ann1 etc.

StevenCannon-USDA commented 1 year ago

Including the full prefix in the genotype array would best represent what's actually in the pan-gene set. Otherwise, multiple annotations collapse into e.g. Wm82, as your example shows, @adf-ncgr. Also, the bare genotype doesn't convey the species origin.

We would need to special-case this data type though, since "glyma.Wm82.gnm2.ann1" isn't properly a simple "genotype." Is that do-able, @sammyjava?

adf-ncgr commented 1 year ago

that's why I changed it from list "genotype" to list "annotation" - presumably this new datatype is going to be subject to new validation requirements anyway...

StevenCannon-USDA commented 1 year ago

Yeah, I like "annotation." Actually, the following variant might be even better, since it captures a key characteristic of the pan-gene set. The "main" annotations are the ones used to establish the clusters. The "extra" annotations are then added to the clusters by homology. This allows inclusion of funkier or more-distant annotations without distorting the underlying set structure.

annotations_main:
  - vigun.CB5-2.gnm1.ann1
  - vigun.IT97K-499-35.gnm1.ann2
  - vigun.Sanzi.gnm1.ann1
  - vigun.Suvita2.gnm1.ann1
  - vigun.TZ30.gnm1.ann2
  - vigun.UCR779.gnm1.ann1
  - vigun.ZN016.gnm1.ann2

annotations_extra:
  - vigun.IT97K-499-35.gnm1.ann1
  - vigan.Gyeongwon.gnm3.ann1
  - vigan.Shumari.gnm1.ann1
  - vigra.VC1973A.gnm6.ann1
  - vigra.VC1973A.gnm7.ann1

adf-ncgr commented 1 year ago

sounds good to me. Just FYI, I have started a preview GCV based on the glyci.mixed.pan3 set and wondered if it might make sense to make the hsh.tsv file more parallel in structure to the gfa file, in the sense that it would have gene first, then pangene (analogous to family), and then the protein/cds identifier (perhaps other columns of metadata could be added to describe goodness of fit later, similar to the gene family assignment scores). This would be a way of addressing my Quibble 2 which probably had its subconscious origin in my wanting to make a GCV.

FWIW (and dangerously close to hijacking the issue) the preview GCV is here: http://dev.lis.ncgr.org:50015/glycine_pan/gcv this currently contains all of the Liu et al and perennial glycines, maybe a couple of other soybeans (I'll be adding more soon). I haven't dug in too much although at a high level it seems to be making sense. The perennial glycines show a few quirks but that's probably to be expected. It would be nice to have a way of integrating gene family assignments along with the pangene assignments, in order to be able to see when tandem genes treated as different from the pan-perspective are similar enough to be considered family members (also to see when genes treated as orphans from the pangenes perspective have been assigned to some family).

sammyjava commented 1 year ago

There should be an icon, like a masked guy with a gun, for when a thread gets hijacked. Maybe a beret and AK-47. Something like that.

StevenCannon-USDA commented 1 year ago

OK - I have made the changes corresponding with what I think is the current consensus. So far, I have done this only for Cicer (Cicer/GENUS/Cicer.pan1.SV8C/). A few potentially disruptive things:

None of the filenames have a genotype. This means they have a three-part yuck (Cicer.pan1.KEY4), rather than four or five-part elsewhere (cicar.ICC4958.gnm2.bg5m, cicar.ICC4958.gnm2.ann1.LCVX)
There is no "species" field in the README

Rather than genotype in the README, we have e.g.:


annotations_main:
- cicar.CDCFrontier.gnm3.ann1
- cicar.ICC4958.gnm2.ann1
- cicec.S2Drd065.gnm1.ann1
- cicre.Besev079.gnm1.ann1

annotations_extra:

cicar.CDCFrontier.gnm1.ann1
cicar.CDCFrontier.gnm2.ann1

Let me know if this passes muster. If so, I'll process the other pangene sets similarly. (These are all done with ds_souschef.pl + config).

sammyjava commented 1 year ago

The spec says "list of genes" in the clust.tsv file, but it's actually a list of proteins: Which will it be? (I prefer proteins since it's more reliable to strip the protein suffix than to presume that the protein suffix is .1).

Cicer.pan1.pan00001     cicar.CDCFrontier.gnm3.ann1.Ca2g036100.1        cicar.CDCFrontier.gnm3.ann1.Ca2g036500.1        cicar.CDCFrontier.gnm3.ann1.Ca2g037100.1        cicec.S2Drd065.gnm1.ann1.Ce2g038300.1   cicre.Besev079.gnm1.ann1.Cr2g038300.1      cicre.Besev079.gnm1.ann1.Cr2g037900.1   cicar.ICC4958.gnm2.ann1.Ca_04511.1      cicec.S2Drd065.gnm1.ann1.Ce0g197600.1   cicar.CDCFrontier.gnm1.ann1.Ca_20032.1  cicar.CDCFrontier.gnm1.ann1.Ca_22388.1     cicar.CDCFrontier.gnm1.ann1.Ca_27918.1  cicar.CDCFrontier.gnm2.ann1.Ca_03257.1  cicar.CDCFrontier.gnm2.ann1.Ca_24765.1
Cicer.pan1.pan00002     cicar.CDCFrontier.gnm3.ann1.Ca3g160700.1        cicar.ICC4958.gnm2.ann1.Ca_06696.1      cicar.CDCFrontier.gnm3.ann1.Ca3g160600.1        cicec.S2Drd065.gnm1.ann1.Ce3g172100.1   cicec.S2Drd065.gnm1.ann1.Ce3g001500.1      cicre.Besev079.gnm1.ann1.Cr3g175500.1   cicar.CDCFrontier.gnm1.ann1.Ca_06009.1  cicar.CDCFrontier.gnm1.ann1.Ca_06010.1  cicar.CDCFrontier.gnm2.ann1.Ca_06937.1  cicar.CDCFrontier.gnm2.ann1.Ca_06938.1

StevenCannon-USDA commented 1 year ago

It should be proteins (and CDS) rather than genes. I'll make that correction in the MANIFEST.descriptions and spec.

sammyjava commented 1 year ago

OK - I have made the changes corresponding with what I think is the current consensus. So far, I have done this only for Cicer (Cicer/GENUS/Cicer.pan1.SV8C/). A few potentially disruptive things...

Shouldn't it be Cicer/GENUS/pangenes/Cicer.pan1.SV8C ?

StevenCannon-USDA commented 1 year ago

@sammyjava -- oops - yes! Will correct that shortly ...

sammyjava commented 1 year ago

It looks like you've added two lists to the pangenes README:

annotations_main:
  - cicar.CDCFrontier.gnm3.ann1
  - cicar.ICC4958.gnm2.ann1
  - cicec.S2Drd065.gnm1.ann1
  - cicre.Besev079.gnm1.ann1

annotations_extra:
  - cicar.CDCFrontier.gnm1.ann1
  - cicar.CDCFrontier.gnm2.ann1

Those of course make the README fail validation until I add them. But also I should break out the individual files into separate files like GENUS/pangenes/clust.md since the README is meant to describe the README. :) Happy to do so, just checking that these two new YAML lists are truly legit before I get into it.

adf-ncgr commented 1 year ago

I thought this is what we had agreed to earlier in the thread. I am fine with the clst files having only the protein/CDS ids, but maintain my earlier request to have gene ids added as an extra column to the hsh files since trimming off suffixes to get gene ids is not reliable and this seems completely analogous to what we do for gfa files.

StevenCannon-USDA commented 1 year ago

"these two new YAML lists are truly legit" -- I like them. The main potential objection I can imagine is that they are particular to this pipeline ... but then all of these results ARE particular to this pipeline. I guess if we describe the results of another pan-gene tool, we will likely use just "annotations_main."

"have gene ids added as an extra column to the hsh files" -- OK, but I probably won't get to this for a week or so, since it will require changes to pandagma and then to ds_souschef and the configs ... and I'll be traveling this weekend. But in principle, I am on board.

adf-ncgr commented 1 year ago

no worries on delay, I'm just glad our principles are on the same good ship (legumepop).

sammyjava commented 1 year ago

"these two new YAML lists are truly legit" -- I like them. The main potential objection I can imagine is that they are particular to this pipeline ... but then all of these results ARE particular to this pipeline. I guess if we describe the results of another pan-gene tool, we will likely use just "annotations_main."

Yeah, it's common to have README attributes specific to a type of collection. I just need to add them to the Java code, flagging them as not required for non-pangenes READMEs. (And presumably required for pangenes READMEs.)

I'll also reorganize the spec a bit so we have explicit file specs in the directory tree.

StevenCannon-USDA commented 1 year ago

Closing this, since we seem to have come to consensus, and pan-gene collections are in place.

sammyjava commented 1 year ago

Just to be sure, please confirm that https://github.com/legumeinfo/datastore-specifications/tree/main/Genus/GENUS/pangenes is correct. Since I'll use that for validation.

StevenCannon-USDA commented 1 year ago

@sammyjava - Checked and confirmed. I made minor tweaks to the readme and example files, but nothing substantive. The README describes what is in the collections.

legumeinfo / datastore-specifications

RFO: pangene specification and collections #39

Request For Objections for a new specification for pan-gene collections.