Draft spec for pangene search

StevenCannon-USDA commented 1 year ago

Please see draft spec for pangenes search query - to find ~paralogous/allelic genes (corresponding by homology and synteny): https://github.com/legumeinfo/website-ui-specs/tree/main/pangenes-search

... and provide feedback. Please respond via this issue. @sammyjava @That-Thing @maxglycine @jd-campbell @alancleary @adf-ncgr @sdash-github

The pangene sets we have in the Data Store currently are for: Arachis, Cicer, Glycine, Medicago, Phaseolus, Vigna. I've tried to make the spec suitable for use at LegumeInfo, SoyBase, and PeanutBase.

This spec may again come before the mine backend is ready ... but it sounds like it is on the way.

sammyjava commented 1 year ago

Yeah, the mine 5.1.0.3 graphql-server is ready, and we can test against the dev MiniMine, which is on 5.1.0.3. So nothing holding us back pangene set-wise. The dev MiniMine is at https://mines.dev.lis.ncgr.org/minimine/begin.do

sammyjava commented 1 year ago

FYI, here's what PanGeneSet looks like in the graphql-server branch, just a bucket o' genes and proteins.

<class name="PanGeneSet" extends="Annotatable" is-interface="true">
        <collection name="dataSets" referenced-type="DataSet"/>
        <collection name="genes" referenced-type="Gene" reverse-reference="panGeneSets"/>
        <collection name="proteins" referenced-type="Protein" reverse-reference="panGeneSets"/>
</class>

type PanGeneSet implements Annotatable {
  ## Annotatable
  id: ID!
  identifier: ID!
  ontologyAnnotations: [OntologyAnnotation!]!
  publications: [Publication!]!
  ## PanGeneSet
  dataSets: [DataSet]
  genes: [Gene]
  proteins: [Protein]
}

adf-ncgr commented 1 year ago

thanks @StevenCannon-USDA I have a couple of minor (maybe) comments/questions on the initial spec:

the results you show seem to be displaying transcript/protein isoform ids; is this intended or should we just focus on the gene ids in what we present (seems cleaner to me)?
might we want to provide any additional details about the member genes such as their locations or sizes (e.g. to give at least a crude sense for variability)?
is there any implied sorting in how the pangene members are listed?
should the accession dropdown support multi-selection (e.g. suppose I want to get allelic comparisons between two favorite lines). And note that your first example seems to imply it shouldn't be a dropdown, but a text box matched as "contains"?
might we want to make explicit when a given accession is absent from a pangene set? e.g. suppose I wanted to know about genes that are missing from my favorite soybean line- would I want to get empty pangene representations for those pangene sets in which a selected accession does not occur, or simply not get them in the returned results?
would we want a linkout for the set of genes belonging to a pangene (e.g. pushing them to the GCV multi-alignment view or to an intermine list)

some of these are probably just stuff to think about for future iterations.

maxglycine commented 1 year ago

May want to add an output option to download query results to the users computer. A query could return a large amount of identifiers and the user may want to save them. Otherwise, the user would have to copy html text and paste it somewhere.

sammyjava commented 1 year ago

Genes in this pangene set would be best implemented by adding "size" to the PanGeneSet object in the mines and populating it in a post-processor, as we do with GeneFamily. That is not currently present in PanGeneSet in 5.1.0.3. Nor are there any other aggregate quantities like we have in GeneFamily 5.1.0.3:

<class name="GeneFamily" extends="Annotatable" is-interface="true" term="">
        <attribute name="description" type="java.lang.String"/>
        <attribute name="version" type="java.lang.String"/>
        <attribute name="size" type="java.lang.Integer"/>
        <reference name="phylotree" referenced-type="Phylotree" reverse-reference="geneFamily"/>
        <collection name="genes" referenced-type="Gene"/>
        <collection name="proteins" referenced-type="Protein"/>
        <collection name="proteinDomains" referenced-type="ProteinDomain" reverse-reference="geneFamilies"/>
        <collection name="dataSets" referenced-type="DataSet"/>
        <collection name="tallies" referenced-type="GeneFamilyTally" reverse-reference="geneFamily"/>
</class>

If this is a Big Deal, stop me from building 5.1.0.3 mines. GlycineMine 5.1.0.3 is almost built, took two weeks.

sammyjava commented 1 year ago

May want to add an output option to download query results to the users computer. A query could return a large amount of identifiers and the user may want to save them. Otherwise, the user would have to copy html text and paste it somewhere.

This sounds like an across-the-board option that would be implemented for all results output like pagination. Thoughts, @alancleary ? After all, we all remember that "Every page should have a download button!" :)

StevenCannon-USDA commented 1 year ago

@sammyjava - "Genes in this pangene set" - I would say "not a big deal" (not a high priority in the first implementation).

sammyjava commented 1 year ago

@StevenCannon-USDA I'm a bit confused about the scope of this search. Are you saying that we'll have a list of pangene sets, each with its corresponding genes listed below it? For example, what happens if the only search element is "Glycine", all else left blank? A gigantic list of all Glycine pangene-sets with their genes? (Which is fine, if that's what you want.)

sammyjava commented 1 year ago

And, if so, are you specifying that pagination be on a pangene-set-to-pangene-set basis? Each page displays a single pangene set? (That's just setting the page size to 1, which is easy. The list of genes within a pangene set would be part of that pangene set record's display.) Just want some detail on pagination expectations when we've got results which are a list of lists.

legumeinfo / website-ui-specs

Draft spec for pangene search #9