Draft spec for pangeneset-based gene list translation UI

adf-ncgr commented 4 months ago

Initial thoughts here: https://github.com/legumeinfo/website-ui-specs/tree/main/pangeneset-based-gene-id-translation feel free to propose changes in this issue or signify your consent (silence by the end of the week will imply consent).

Minor note, I tried using @sdash-github's plantuml for UI mockup (as described here: https://plantuml.com/salt); not sure if we'll decide to adopt this for more complex cases, but it seemed worth a try. If nothing else, it reminded me what a PITA specifying UI layout through use of nested tables can be (one of the reasons I never became a web developer, I think, though I know that's no longer the way...)

alancleary commented 4 months ago

How important is it that the UI match what you've mocked up? It'll be easier (and stylistically consistent with our existing components) if the entire form, including the gene list, is above the results table.

Also, note that the GraphQL endpoint that was previously prototyped for this only returns genes with no notion of what genes corresponds to what input genes. This can be remedied by either introducing new types into the GraphQL API or requesting the output genes' pan gene sets and their genes and then doing some post-processing. Neither is ideal. An alternative is to revise the GraphQL endpoint such that it gets pan genes for a single input gene. The UI element would then send multiple requests to this endpoint when the form is submitted - one for each gene in the input list.

Perhaps this belongs under "future enhancements"; do you want the table to be sortable by one or more columns?

adf-ncgr commented 4 months ago

It's not important to match the UI layout, that just seemed compact to me.

Having the actual correspondences between input genes and results is critical. The intermine queries I had specified did this, will have to double-check if they can be adapted to work with the "ONE OF" constraint instead of "IN LIST".

Not sure what your third paragraph refers to- seems like your thought got truncated there?

Sortable tables are always appreciated, but I think it's OK to leave that for a future enhancement.

StevenCannon-USDA commented 4 months ago

I second all parts of the response by @adf-ncgr. Regarding sorting: I'd say it's not even an enhancement in this case. My vote is to plan for "never sort."

The main requirements are:

Accept an input list of gene IDs and a target annotation accession+version
Return a table* that indicates correspondences between the input IDs, a pangene ID, and the target accession.
The table* needs to have a syntax that allows for one-to-null and one-to-many relationships.
Downloadable result (this might be OK to handle as a future enhancement, but it will be immediately desired)

Caveat regarding table* for the output: other structures might be acceptable, but a table is what will be most familiar to users. The additional wrinkle is representing one-to-many relationships, e.g.

Q1 pan9 A
Q2 pan8 B
Q3 pan1 none
Q4 pan2 C, D, E

Also, we'll need to be able to report an error message for query genes not found

alancleary commented 4 months ago

Having the actual correspondences between input genes and results is critical. The intermine queries I had specified did this, will have to double-check if they can be adapted to work with the "ONE OF" constraint instead of "IN LIST".

Right. My point is that just because we can craft an intermine query that will give us all the results in one request doesn't mean it translates well into GraphQL. The most canonical way to handle this in GraphQL is via filtering. This would make our API more expressive, rather than more esoteric.

The additional wrinkle is representing one-to-many relationships, e.g.
Q1 pan9 A
Q2 pan8 B
Q3 pan1 none
Q4 pan2 C, D, E

@adf-ncgr's example in the UI spec handles the one-to-many relationship by adding multiple rows for an input gene if it has multiple output genes - one row for each output gene. You're example here adds a single row and puts all of the output genes into the third column. Which way should this be handled? I lean towards what's currently in the spec as it prevents horizontal overflow when a particular input gene has many result genes.

Also, we'll need to be able to report an error message for query genes not found

The generic search component that this web component will be built on already supports this, although this particular error should be mentioned in the spec as it's specific to this component.

Not sure what your third paragraph refers to- seems like your thought got truncated there?

Sorry about that; remnants of a discarded thought. I edited it out.

adf-ncgr commented 4 months ago

Right. My point is that just because we can craft an intermine query that will give us all the results in one request doesn't mean it translates well into GraphQL. The most canonical way to handle this in GraphQL is via filtering. This would make our API more expressive, rather than more esoteric.

I'm confused. Didn't we implement the ONE OF constraint specifically for this purpose? The GraphQL page you linked seems fairly similar if you're talking about this bit:

query {
  queryPost(filter: {
    id: ["0x1", "0x2", "0x3", "0x4"],
  }) {
    id
    title
    text
    datePublished
  }
}

Regarding how to handle 1-to-many, I don't feel super-strongly about it, but my argument for doing it in the 1 row = 1 gene pair style is that if we were to augment the rows with additional info about the corresponding pair (e.g. "allelic" info like gene length) this would seem more natural.

Regarding query genes not found, I'm not completely sure how to deal with it in this context. I think @StevenCannon-USDA is referring to input tokens that do not match genes in the database, regardless of whether they belong to pangenes or correspond to genes in the target annotation via these pangenes. It almost seems like this would require the list to be validated separately, similar to how intermine list builder currently works. Could this be relegated to future enhancement land?

StevenCannon-USDA commented 4 months ago

I lean towards what's currently in the spec as it prevents horizontal overflow when a particular input gene has many result genes. Yes, that's fine by me.

Regarding query genes not found, ... I think @StevenCannon-USDA is referring to input tokens that do not match genes in the database Right. It will not be uncommon for someone to enter funky stuff -- say, with splice variants, or coming from two different annotations, or wrongly prefixed. It would be nice to at least report which items aren't in the indicated query set. But it could be "relegated to future enhancement land" if implementation is a can'o'worms.

alancleary commented 4 months ago

I'm confused. Didn't we implement the ONE OF constraint specifically for this purpose?

Sam implemented the ONE OF constraint as a prototype. It never got merged (or scrutinized) because we didn't make it this far in the discussion.

The GraphQL page you linked seems fairly similar if you're talking about this bit: <code>

That's not the bit I was referring to but it could be used to handle multiple input genes in a single request. What I'm really interested in is "While fetching nested linked objects, you can also apply a filter on them."

query {
  getGene(id: "...") {
    name
    panGeneSets {
      identifier,
      genes(filter: {
        genus: "...",
        species: "...",
        strain: "...",
        assembly: "...",
        annotation: "..."
      }) {
        identifier
      }
    }
  }
}

The implementation details of this would be to add assembly and annotation parameters to our existing gene search endpoint and updating the genes field of the PanGeneSet type to expose the necessary constraints.

Regarding query genes not found... It almost seems like this would require the list to be validated separately, similar to how intermine list builder currently works. Could this be relegated to future enhancement land?

If we take the "one-request-per-input-gene" approach I'm advocating here then reporting which genes weren't found becomes trivial since the GraphQL server throws an error when a gene can't be found for the given identifier.

maxglycine commented 4 months ago

Ok, I think I understand the commentary so far. And I think I have few observations: For SoyBase, I think we want to allow use of the "secondary id" ie, glyma.01g00100. In the specification, you have to enter the query genus, species, assembly, and strain, so why do you need to use the full yuck as a query as it can be constructed on the fly? For the output, I think the full yuck is grudgingly acceptable. The pangene_set id needs to be a link to the pangene report page. Depending on the level of difficulty, ie If it is relatively easy, the download capability should be engineered in the 1.0 version. Granted people are not going to be using this tool to convert the names of hundreds of genes, but the need to copy-n-paste out of HTML tables should be avoided.

@StevenCannon-USDA @adf-ncgr @alancleary @jd-campbell

alancleary commented 4 months ago

For the output, I think the full yuck is grudgingly acceptable. The pangene_set id needs to be a link to the pangene report page.

As with the other web components, linking is a post-graphql, pre-web component step, i.e. the links are site specific and inserted after the data is fetched right before it's displayed. You link things wherever you want!

Depending on the level of difficulty, ie If it is relatively easy, the download capability should be engineered in the 1.0 version.

This is separate functionality that should be encapsulated in it's own component or utility script. We can pursue it in tandem with this UI element but it needs its own spec.

Granted people are not going to be using this tool to convert the names of hundreds of genes...

I bet you $1 they do.

adf-ncgr commented 4 months ago

If we take the "one-request-per-input-gene" approach I'm advocating here then reporting which genes weren't found becomes trivial since the GraphQL server throws an error when a gene can't be found for the given identifier.

I'm only concerned that one request per input gene won't be very performant, but could be convinced otherwise if you can prototype it (because you know you'll win that $1 bet with @maxglycine when I come up to the plate)

adf-ncgr commented 4 months ago

For SoyBase, I think we want to allow use of the "secondary id" ie, glyma.01g00100. In the specification, you have to enter the query genus, species, assembly, and strain, so why do you need to use the full yuck as a query as it can be constructed on the fly?

What you are specifying in the proposed UI is the target genus, species, assembly, and strain, not the query. It's possible that we should do similar for the query, but the current spec would allow mixed source inputs (which might not be all that useful, admittedly).

alancleary commented 4 months ago

I'm only concerned that one request per input gene won't be very performant, but could be convinced otherwise if you can prototype it (because you know you'll win that $1 bet with @maxglycine when I come up to the plate)

OK. I'll see what I can come up with!

maxglycine commented 4 months ago

If we are concerned that users will try to "convert" names en masse from one assembly to another, I am OK with telling them there is a limit of say 50 per request to make it too painful to execute. @alancleary @adf-ncgr @StevenCannon-USDA @jd-campbell

adf-ncgr commented 4 months ago

@maxglycine I think we ought to be able to handle 100s if not 1000s of genes whatever approach we decide to adopt here since this will (I hope) lay the foundation for other list-oriented services where imposing draconian limits might defeat the purpose (e.g. gene set enrichment analysis). Just my 2c though.

maxglycine commented 4 months ago

I agree with @adf-ncgr in that we need to preserve the linkage between the gene model name query and all of its related "pan" genes. Whether that is multiple rows for each query gene model name ie:

glyma.Wm82.gnm4.ann1.glyma.01g00100PanGene1 glyma.Wm82.gnm4.ann1.glyma.01g00100PanGene2 ... ...

or glyma.Wm82.gnm4.ann1.glyma.01g00100PanGene1,PanGene2,...

maxglycine commented 4 months ago

I agree with @StevenCannon-USDA if they want to sort, download the list and do it in Excel :0

sdash-github commented 4 months ago

Good Example from MaizeGDB

Ethy talked about it and demoed her creation for MGDB yesterday at AgBioData meeting. It is at:
Panagene work at MGDB with an example

This can be a good model for us to keep in mind as we make progress and for future development at LIS+.
It is backed by real community feedback and use.
Liked these [query page](https://maizegdb.org/pan_gene_center/pan_gene):(Advance search) features.
- Simple search allows you to find pan-genes by locus symbol, gene model ID, transcript ID, or protein ID.
- Many contains/excludes/associated with genemodels, loci, %annotations
- Textboxes(pre-filled) for annotations to include/exclude
- Link to external tools.
- Brief description of the this pangene analysis in maize
- Definations(A great help for students/ removal of ambiguity is usage/ for non-maize visitors)
Liked these result-page features (in the collapsible headings) as very informative to put the result in context.
- Pan-gene alignment
- The phylogenetic tree (view of the result pangene sets)
- 3rd party comparative viewers (with succinct-specific info about how to use it)
(Other features like download, function, proteins, SNPs, etc.: we are aware of them or are going to include them).
My intention here is mainly for future development and not for the first prototype we will work on. I think @StevenCannon-USDA and @maxglycine are aware of this MGDB work but @adf-ncgr may not yet be.

Impressed because it looks so complete a tool !!

alancleary commented 4 months ago

Hi everyone,

I prototyped a way of doing pangene list queries per gene by adding arguments to the genes fields of the PanGeneSet type in the GraphQL server. You can now get a list of pangenes for a specific gene by querying for that gene and filtering the genes of its pangene sets:

# query
query PangeneListQuery($identifier: ID!, $genus: String, $species: String, $strain: String, $assembly: String, $annotation: String) {
  gene(identifier: $identifier) {
    results {
      panGeneSets {
        genes(genus: $genus, species: $species, strain: $strain, assembly: $assembly, annotation: $annotation) {
          identifier
        }
      }
    }
  }
}

Here are the variables I used to verify the functionality: I'm sure someone here can come up with a more interesting pangene set to test this on:

{
  "identifier": "phavu.G19833.gnm1.ann1.Phvul.001G000200",
  "genus": "Phaseolus",
  "species": "vulgaris",
  "strain": "G19833",
  "assembly": "gnm1",
  "annotation": "ann1"
}

These changes have been pushed to the pangene-list branch in the graphql-server repo. Queries can be run by starting the server (npm start) and then going to http://localhost:4000/ in your web browser to use the Apollo Sandbox.

StevenCannon-USDA commented 3 months ago

Very nice initial implementation, @alanclery! (June 18). My feeling is that this would be quite acceptable as a first-pass solution - though putting a limiter on the number of queries would be important if performance is poor beyond some number.

Another feature that I think will be desired (albeit at the cost of some additional UI complexity and more development time) is to handle inputs that lack the full yuck, i.e. Glyma.01G002100 as an alternative to glyma.Wm82.gnm4.ann1.Glyma.01G002100. (Rex's earlier comment regarding "secondary id" is for this feature)

One way this could be accomplished is to provide two forms of this page -- one, taking unprefixed IDs, would have 10 specification fields -- five for the query and five for the target.

Alternatively, provide an optional sixth input field on this page, in which the user can provide a prefix string to be added to the query elements, e.g. glyma.Wm82.gnm4.ann1.

Of those two options, the second one looks better to me at the moment.

(I'll add: the reason that Glyma.01G002100 can't be used as-is without further specification is that it is found in three different annotations (glyma.Wm82.gnm2.ann1, glyma.Wm82.gnm4.ann1, glyma.Wm82.gnm5.ann1). In that case, the genes happen to correspond (they all belong to Glycine.pan5.pan27993); but there are cases where the unprefixed genes from different annotations are not the "same gene.")

legumeinfo / website-ui-specs

Draft spec for pangeneset-based gene list translation UI #19

Good Example from MaizeGDB