Open adf-ncgr opened 4 months ago
How important is it that the UI match what you've mocked up? It'll be easier (and stylistically consistent with our existing components) if the entire form, including the gene list, is above the results table.
Also, note that the GraphQL endpoint that was previously prototyped for this only returns genes with no notion of what genes corresponds to what input genes. This can be remedied by either introducing new types into the GraphQL API or requesting the output genes' pan gene sets and their genes and then doing some post-processing. Neither is ideal. An alternative is to revise the GraphQL endpoint such that it gets pan genes for a single input gene. The UI element would then send multiple requests to this endpoint when the form is submitted - one for each gene in the input list.
Perhaps this belongs under "future enhancements"; do you want the table to be sortable by one or more columns?
It's not important to match the UI layout, that just seemed compact to me.
Having the actual correspondences between input genes and results is critical. The intermine queries I had specified did this, will have to double-check if they can be adapted to work with the "ONE OF" constraint instead of "IN LIST".
Not sure what your third paragraph refers to- seems like your thought got truncated there?
Sortable tables are always appreciated, but I think it's OK to leave that for a future enhancement.
I second all parts of the response by @adf-ncgr. Regarding sorting: I'd say it's not even an enhancement in this case. My vote is to plan for "never sort."
The main requirements are:
Caveat regarding table* for the output: other structures might be acceptable, but a table is what will be most familiar to users. The additional wrinkle is representing one-to-many relationships, e.g.
Q1 pan9 A
Q2 pan8 B
Q3 pan1 none
Q4 pan2 C, D, E
Also, we'll need to be able to report an error message for query genes not found
Having the actual correspondences between input genes and results is critical. The intermine queries I had specified did this, will have to double-check if they can be adapted to work with the "ONE OF" constraint instead of "IN LIST".
Right. My point is that just because we can craft an intermine query that will give us all the results in one request doesn't mean it translates well into GraphQL. The most canonical way to handle this in GraphQL is via filtering. This would make our API more expressive, rather than more esoteric.
The additional wrinkle is representing one-to-many relationships, e.g.
Q1 pan9 A Q2 pan8 B Q3 pan1 none Q4 pan2 C, D, E
@adf-ncgr's example in the UI spec handles the one-to-many relationship by adding multiple rows for an input gene if it has multiple output genes - one row for each output gene. You're example here adds a single row and puts all of the output genes into the third column. Which way should this be handled? I lean towards what's currently in the spec as it prevents horizontal overflow when a particular input gene has many result genes.
Also, we'll need to be able to report an error message for query genes not found
The generic search component that this web component will be built on already supports this, although this particular error should be mentioned in the spec as it's specific to this component.
Not sure what your third paragraph refers to- seems like your thought got truncated there?
Sorry about that; remnants of a discarded thought. I edited it out.
Right. My point is that just because we can craft an intermine query that will give us all the results in one request doesn't mean it translates well into GraphQL. The most canonical way to handle this in GraphQL is via filtering. This would make our API more expressive, rather than more esoteric.
I'm confused. Didn't we implement the ONE OF constraint specifically for this purpose? The GraphQL page you linked seems fairly similar if you're talking about this bit:
query {
queryPost(filter: {
id: ["0x1", "0x2", "0x3", "0x4"],
}) {
id
title
text
datePublished
}
}
Regarding how to handle 1-to-many, I don't feel super-strongly about it, but my argument for doing it in the 1 row = 1 gene pair style is that if we were to augment the rows with additional info about the corresponding pair (e.g. "allelic" info like gene length) this would seem more natural.
Regarding query genes not found, I'm not completely sure how to deal with it in this context. I think @StevenCannon-USDA is referring to input tokens that do not match genes in the database, regardless of whether they belong to pangenes or correspond to genes in the target annotation via these pangenes. It almost seems like this would require the list to be validated separately, similar to how intermine list builder currently works. Could this be relegated to future enhancement land?
I lean towards what's currently in the spec as it prevents horizontal overflow when a particular input gene has many result genes.
Yes, that's fine by me.
Regarding query genes not found, ... I think @StevenCannon-USDA is referring to input tokens that do not match genes in the database
Right. It will not be uncommon for someone to enter funky stuff -- say, with splice variants, or coming from two different annotations, or wrongly prefixed. It would be nice to at least report which items aren't in the indicated query set. But it could be "relegated to future enhancement land" if implementation is a can'o'worms.
I'm confused. Didn't we implement the ONE OF constraint specifically for this purpose?
Sam implemented the ONE OF constraint as a prototype. It never got merged (or scrutinized) because we didn't make it this far in the discussion.
The GraphQL page you linked seems fairly similar if you're talking about this bit: <code>
That's not the bit I was referring to but it could be used to handle multiple input genes in a single request. What I'm really interested in is "While fetching nested linked objects, you can also apply a filter on them."
query {
getGene(id: "...") {
name
panGeneSets {
identifier,
genes(filter: {
genus: "...",
species: "...",
strain: "...",
assembly: "...",
annotation: "..."
}) {
identifier
}
}
}
}
The implementation details of this would be to add assembly
and annotation
parameters to our existing gene search endpoint and updating the genes
field of the PanGeneSet
type to expose the necessary constraints.
Regarding query genes not found... It almost seems like this would require the list to be validated separately, similar to how intermine list builder currently works. Could this be relegated to future enhancement land?
If we take the "one-request-per-input-gene" approach I'm advocating here then reporting which genes weren't found becomes trivial since the GraphQL server throws an error when a gene can't be found for the given identifier.
Ok, I think I understand the commentary so far. And I think I have few observations: For SoyBase, I think we want to allow use of the "secondary id" ie, glyma.01g00100. In the specification, you have to enter the query genus, species, assembly, and strain, so why do you need to use the full yuck as a query as it can be constructed on the fly? For the output, I think the full yuck is grudgingly acceptable. The pangene_set id needs to be a link to the pangene report page. Depending on the level of difficulty, ie If it is relatively easy, the download capability should be engineered in the 1.0 version. Granted people are not going to be using this tool to convert the names of hundreds of genes, but the need to copy-n-paste out of HTML tables should be avoided.
@StevenCannon-USDA @adf-ncgr @alancleary @jd-campbell
For the output, I think the full yuck is grudgingly acceptable. The pangene_set id needs to be a link to the pangene report page.
As with the other web components, linking is a post-graphql, pre-web component step, i.e. the links are site specific and inserted after the data is fetched right before it's displayed. You link things wherever you want!
Depending on the level of difficulty, ie If it is relatively easy, the download capability should be engineered in the 1.0 version.
This is separate functionality that should be encapsulated in it's own component or utility script. We can pursue it in tandem with this UI element but it needs its own spec.
Granted people are not going to be using this tool to convert the names of hundreds of genes...
I bet you $1 they do.
If we take the "one-request-per-input-gene" approach I'm advocating here then reporting which genes weren't found becomes trivial since the GraphQL server throws an error when a gene can't be found for the given identifier.
I'm only concerned that one request per input gene won't be very performant, but could be convinced otherwise if you can prototype it (because you know you'll win that $1 bet with @maxglycine when I come up to the plate)
For SoyBase, I think we want to allow use of the "secondary id" ie, glyma.01g00100. In the specification, you have to enter the query genus, species, assembly, and strain, so why do you need to use the full yuck as a query as it can be constructed on the fly?
What you are specifying in the proposed UI is the target genus, species, assembly, and strain, not the query. It's possible that we should do similar for the query, but the current spec would allow mixed source inputs (which might not be all that useful, admittedly).
I'm only concerned that one request per input gene won't be very performant, but could be convinced otherwise if you can prototype it (because you know you'll win that $1 bet with @maxglycine when I come up to the plate)
OK. I'll see what I can come up with!
If we are concerned that users will try to "convert" names en masse from one assembly to another, I am OK with telling them there is a limit of say 50 per request to make it too painful to execute. @alancleary @adf-ncgr @StevenCannon-USDA @jd-campbell
@maxglycine I think we ought to be able to handle 100s if not 1000s of genes whatever approach we decide to adopt here since this will (I hope) lay the foundation for other list-oriented services where imposing draconian limits might defeat the purpose (e.g. gene set enrichment analysis). Just my 2c though.
I agree with @adf-ncgr in that we need to preserve the linkage between the gene model name query and all of its related "pan" genes. Whether that is multiple rows for each query gene model name ie:
glyma.Wm82.gnm4.ann1.glyma.01g00100
or
glyma.Wm82.gnm4.ann1.glyma.01g00100
I agree with @StevenCannon-USDA if they want to sort, download the list and do it in Excel :0
Ethy talked about it and demoed her creation for MGDB yesterday at AgBioData meeting. It is at:
Panagene work at MGDB with an example
Impressed because it looks so complete a tool !!
Hi everyone,
I prototyped a way of doing pangene list queries per gene by adding arguments to the genes
fields of the PanGeneSet
type in the GraphQL server. You can now get a list of pangenes for a specific gene by querying for that gene and filtering the genes of its pangene sets:
# query
query PangeneListQuery($identifier: ID!, $genus: String, $species: String, $strain: String, $assembly: String, $annotation: String) {
gene(identifier: $identifier) {
results {
panGeneSets {
genes(genus: $genus, species: $species, strain: $strain, assembly: $assembly, annotation: $annotation) {
identifier
}
}
}
}
}
Here are the variables I used to verify the functionality: I'm sure someone here can come up with a more interesting pangene set to test this on:
{
"identifier": "phavu.G19833.gnm1.ann1.Phvul.001G000200",
"genus": "Phaseolus",
"species": "vulgaris",
"strain": "G19833",
"assembly": "gnm1",
"annotation": "ann1"
}
These changes have been pushed to the pangene-list
branch in the graphql-server repo. Queries can be run by starting the server (npm start
) and then going to http://localhost:4000/ in your web browser to use the Apollo Sandbox.
Very nice initial implementation, @alanclery! (June 18). My feeling is that this would be quite acceptable as a first-pass solution - though putting a limiter on the number of queries would be important if performance is poor beyond some number.
Another feature that I think will be desired (albeit at the cost of some additional UI complexity and more development time) is to handle inputs that lack the full yuck, i.e. Glyma.01G002100
as an alternative to glyma.Wm82.gnm4.ann1.Glyma.01G002100
. (Rex's earlier comment regarding "secondary id" is for this feature)
One way this could be accomplished is to provide two forms of this page -- one, taking unprefixed IDs, would have 10 specification fields -- five for the query and five for the target.
Alternatively, provide an optional sixth input field on this page, in which the user can provide a prefix string to be added to the query elements, e.g. glyma.Wm82.gnm4.ann1.
Of those two options, the second one looks better to me at the moment.
(I'll add: the reason that Glyma.01G002100
can't be used as-is without further specification is that it is found in three different annotations (glyma.Wm82.gnm2.ann1, glyma.Wm82.gnm4.ann1, glyma.Wm82.gnm5.ann1). In that case, the genes happen to correspond (they all belong to Glycine.pan5.pan27993); but there are cases where the unprefixed genes from different annotations are not the "same gene.")
Initial thoughts here: https://github.com/legumeinfo/website-ui-specs/tree/main/pangeneset-based-gene-id-translation feel free to propose changes in this issue or signify your consent (silence by the end of the week will imply consent).
Minor note, I tried using @sdash-github's plantuml for UI mockup (as described here: https://plantuml.com/salt); not sure if we'll decide to adopt this for more complex cases, but it seemed worth a try. If nothing else, it reminded me what a PITA specifying UI layout through use of nested tables can be (one of the reasons I never became a web developer, I think, though I know that's no longer the way...)