Implement summaryPath request in search-genes to support pagination metadata

sammyjava commented 1 year ago

I gave this the good ol' college try, but I just can't wrap my brain around the various ins and outs of our Typescript design. I know what to do from an HTTP point of view, but I just can't get my brain to implement it in the existing framework. So here's what I think needs to be done, and I'll assign it to you and you can bump it to someone else that has these skillz if there is someone.

Right now, in search-genes.ts, we're returning response2genes(response). That is an array of results in a JSON of the form:

{
  "data": {
    "genes": [
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00619", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00663", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00954", ...
      }
    ]
  }
}

I think what we need to do is to run a summaryPath request on that path query first, which is the same request except the option summaryPath=Gene.id is added, i.e. use const options = { "summaryPath": "Gene.id" } rather than const options = {start, size} in something similar to pathQuery. Call it summaryPathQuery, say. That request returns a result like this:

{
  "uniqueValues":91,
  "modelName":"genomic",
  "columnHeaders":["item","count"],
  "rootClass":"Gene",
  "start":0,
  "views":["Gene.id","Gene.primaryIdentifier","Gene.description","Gene.symbol","Gene.name","Gene.assemblyVersion","Gene.annotationVersion","Gene.length","Gene.briefDescription","Gene.organism.taxonId","Gene.strain.identifier"],
  "results":[
    [3482044,9820071,6139382.087912087912,3072762.2477,20,1,53],
    [3482044,9820071,6139382.087912087912,3072762.2477,20,20,38]
  ],
  "executionTime":"2023.06.06 09:12::21",
  "wasSuccessful":true,
  "error":null,
  "statusCode":200
}

Once we have that back, we run the normal pathQuery request and add the above result.uniqueValues to the "data" of response2genes, thereby returning a JSON like this:

{
  "data": {
    "uniqueValues": 91,
    "genes": [
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00619", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00663", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00954", ...
      }
    ]
  }
}

From then on the web component deals with it (and this doesn't break the current version).

It'd be nice to make this a general thing on all search queries, but the request has Gene.id in it so it's specific to a gene search. But it's a minor add-on once you set it up, presumably in gene.ts and search-genes.ts plus the general summaryPathQuery.

It seems like a pretty well defined task. For someone that understands this framework more than I do. :)

alancleary commented 1 year ago

Thanks for providing the implementation details @sammyjava!

Another thing we need to work out is how to get the pagination info to the user. It looks like there's a couple ways people do this: 1) they have top-level queries return a wrapper type that contains fields for pagination info and a results field that contains the query results or 2) they modify the response object itself to contain an additional field with pagination info, kind of like how Apollo server will add an errors field to the response if there ~when~ were any errors when resolving the query.

Apparently 1 is the "best practice" and would be easier to implement, and it accommodates additional data that we may want to send to the user in the future, but for consistency we would want to update all of the top-level queries to use this wrapper type, which seems a bit invasive. Alternatively, 2 is nice because it separates metadata about the results from the result data itself. However, it also separates these metadata from the GraphQL workflow which could make it a bit contrived to implement and work with.

So I guess I would opt for option 1, even though it adds a layer of ~misdirection~ indirection to every query. Thoughts?

adf-ncgr commented 1 year ago

I think I more or less understand the alternatives you describe; I think I also incline to option 1 but don't have strong feelings.

adf-ncgr commented 1 year ago

PS. when I was a kid we called it "indirection"; sounds like younger generations have become more mistrustful of it...

alancleary commented 1 year ago

PS. when I was a kid we called it "indirection"; sounds like younger generations have become more mistrustful of it...

"Indirection" is the word I meant to use. I guess that's what I get for not proof reading my response. That and a "when" instead of a "were" :shrug:

sammyjava commented 1 year ago

I thought we'd just do it all with a little math in the web component. We know the start value, from that we know the current page from start % size + 1, and we know how many pages total: roundUp(uniqueValues % size). If it's better to do that on the graphql side (it probably is), I'd just add that stuff to the data JSON along with uniqueValues, say currentPage, totalPages. But I'm a simple man.

{
  "data": {
    "uniqueValues": 91,
    "start": 20,
    "size": 20,
    "currentPage": 2
    "totalPages": 5,
    "genes": [
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00619", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00663", ...
      },
      {
        "identifier": "cicar.ICC4958.gnm2.ann1.Ca_00954", ...
      },
      ...
    ]
  }
}

alancleary commented 1 year ago

@sammyjava You've basically described option 2 but pushed the pagination info down into the data object of the response. The data object contains the specific data the users asked for in their query so this would require adding these fields to the API, which I guess is OK because it gives the user control over whether or not they receive these metadata.

However, this only works if your query contains 1 top-level field. If the query has multiple top-level fields then it becomes ambiguous what fields these metadata are for. I guess we could get around this by adding a pagination field to data that itself has a field for each paginated type in the query, but since the user has control over what fields their asking for, they could request metadata for fields not in the query. It would also be awkward mapping between the fields in the pagination object and the data itself if there's multiple top-level fields and metadata types at play.

I guess I've convinced myself that option 1 is the way to go. It seems like adding a little indirection to all of our top-level queries is a small price to pay compared to the alternative.

Regarding what pagination info to actually send, I think including just the number of results should be sufficient since we aren't paginating by default, i.e. if the results are actually paginated then the user should know what page and size they requested.

sammyjava commented 1 year ago

Sure, whatever works. Note, though, that InterMine pathQuery ONLY has a single top-level object, the rootClass. But that's InterMine, not what graphql-server does with it. But since we're querying the mine for uniqueValues, there WILL always be a single rootClass id sent in the summaryPathQuery. But I agree that what the component asks for may have several different uniqueValues in some as-yet-to-be-imagined scenario.

alancleary commented 1 year ago

Thanks for pointing that out. I don't think it should be an issue since the Apollo server makes a different PathQuery for each top-level field.

As for multi-value scenarios, I'm sure one will rear its head eventually, especially if we don't anticipate it in the code :laughing:

sammyjava commented 1 year ago

I would be tempted to legislate that we only build components that have a single root class query. I'm not sure the multi-root-class scenario is worth supporting. It will certainly make the code more complicated. After all, all the templates on the mines are single root-class and they're considered to be useful! I feel like this stuff has gotten rather complex for undergrads cough to work with already. I know I have a lot of trouble with it. But it's your call for sure, this is just my 2 cents. Right now I feel like @alancleary and @ctcncgr are the only people at LIS with the chops to work with this code base, with the exception of me being able to do cut-and-replace level coding. I hit a wall trying to do anything that's actually new.

alancleary commented 1 year ago

I think we have a misunderstanding. I agree that we shouldn't pursue the multi-root-class scenario. The multi-value scenario I said we'll probably encounter is different; it's the client requesting multiple top-level fields in a single GraphQL query. Currently the Apollo server will create a separate PathQuery for each field (and thus will need at most one root class), and I think we should leave it this way. The thing to anticipate is implementing how we include metadata in the response so that it will work with multiple top-level fields, i.e. include the same metadata for multiple top-level fields in a single query.

Regarding the complexity of the implementation, I've pretty much just followed the Apollo server documentation. The one exception being how the InterMine data source uses factories and mixins. There's 2 motivations for that design: 1) so the data source implementation isn't a single class that's hundreds, if not thousands, of lines of code and 2) so we can instantiate the same class for multiple mines, which we've encountered in issue #18.

sammyjava commented 1 year ago

Fair enough, the main problem is that I've never been comfortable with client-side coding and definitely haven't bothered to learn Typescript. I expect the youngsters on board this summer will do fine.

But I still don't quite follow what you're saying. Every search resolver has a root class for the search query. All the dependent objects get resolved (or don't) but that doesn't change the number of results. There is no case in graphql-server where we search on two root objects and then merge them together in some way.

The uniqueValues query would only implemented in search-foo.ts. So I'm just not grokking the multi-value scenario. But my brain is oriented toward how InterMine searches for things with path query. So I'll just watch from the stands as you code this up.

sammyjava commented 1 year ago

I'd like to move forward on this, or have it moved forward. Will you have time in the near-ish future to do this, @alancleary or should I give it one more college try? I'd really like to get a production gene search up, just to show that we can actually get things done if for no other reason.

alancleary commented 1 year ago

I'm already working on it.

alancleary commented 1 year ago

@sammyjava @adf-ncgr @ctcncgr. I've implemented option 1 in the Query schema of the pagination-metadata branch. Apparently GraphQL doesn't actually support generic types so all the types use the same basic structure but it's hand-written... The PageInfo type was just an example from here. We can update it to only contain numResults, as we've discussed, but I wanted your feedback on the updated schema structure before proceeding.

Note that I used Results (plural) because there's already types in the schema that have the Result suffix. Also, I accidentally added a pageInfo field to the locationLinkouts results type because it has a start parameter like all of the paginated fields. It might make sense to change the paginated parameters to not overlap with this. This could be an opportunity to make the parameter names more verbose, like page and pageSize, instead of start and size.

sammyjava commented 1 year ago

Note that I used Results (plural) because there's already types in the schema that have the Result suffix. Also, I accidentally added a pageInfo field to the locationLinkouts results type because it has a start parameter like all of the paginated fields. It might make sense to change the paginated parameters to not overlap with this. This could be an opportunity to make the parameter names more verbose, like page and pageSize, instead of start and size.

I'd be for that, and can do it if you like. The PageInfo type looks fine to me, gets the job done.

alancleary commented 1 year ago

I'd be for that, and can do it if you like. The PageInfo type looks fine to me, gets the job done.

That's good because I've already been pushing forward with it! Don't worry about converting the parameters. I can take care of it since I'll be touching that code anyway.

adf-ncgr commented 1 year ago

all seems fine to me- we are going to remove the accidentally added PageInfo from locationLinkoutsResults though, right?

alancleary commented 1 year ago

all seems fine to me- we are going to remove the accidentally added PageInfo from locationLinkoutsResults though, right?

Yeah. That commit doesn't include an implementation to handle pageInfo fields; I'll remove the erroneous field when I add the implementation.

alancleary commented 1 year ago

The pagination-metadata branch now supports gene search requests that include pagination information. Have a look at the code and/or try it out using the Apollo Explorer. If all seems well, I can proceed with implementing the functionality on all of the InterMine API functions (not all API functions support pagination but I'm using an extensible metadata type so they'll all be endowed with the potential the return arbitrary metadata from the InterMine data source).

alancleary commented 1 year ago

Since I didn't get any feedback on this I've been moving ahead with implementing the same metadata functionality on all the InterMine data source API functions. Hopefully I'll have it wrapped up sometime tomorrow.

sammyjava commented 1 year ago

Yeah, sorry, I saw it just as I was leaving NCGR yesterday. (I'm 50%, usually mornings-ish.) Carry on!

sammyjava commented 1 year ago

Looks good to me. However, I see that "start" now means starting page, not starting record as it does in InterMine, so DEFINITELY rename that field to avoid confusion, as you suggested earlier. (Let me know if you'd like me to do that.)

alancleary commented 1 year ago

@sammyjava While going through and verifying that all subfield resolvers have been updated to use the Intermine data source's new metadata API syntax, I found that there's inconsistencies in what's considered annotatable versus what's implemented. In the Query.graphql file, there's an "Annotatables" block with several types listed. However, not all of these types are include in the GraphQLAnnotatable union type used by the Intermine data source. Furthermore, only a subset of the GraphQL types in the union and the aforementioned block actually implement the Annotatable interface in their GraphQL type, and only a subset of those have implemented subfield resolvers for the subfields defined in the interface.

In short, I'm wondering what should actually be considered annotatable and have implementations of resolvers for the annotatable subfields?

sammyjava commented 1 year ago

Let me know if you want me to enforce consistency. Here is a list of classes that extend Annotatable (in mine version 5.1.0.3):

<class name="ExpressionSource" extends="Annotatable" is-interface="true" term="">
<class name="ProteinDomain" extends="Annotatable" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000417">
<class name="ExpressionSample" extends="Annotatable" is-interface="true" term="">
<class name="BioEntity" extends="Annotatable" is-interface="true" term="">
<class name="GWAS" extends="Annotatable" is-interface="true" term="">
<class name="PanGeneSet" extends="Annotatable" is-interface="true">
<class name="GeneFamily" extends="Annotatable" is-interface="true" term="">
<class name="QTLStudy" extends="Annotatable" is-interface="true" term="">
<class name="GenotypingPlatform" extends="Annotatable" is-interface="true" term="">
<class name="GeneticMap" extends="Annotatable" is-interface="true" term="http://purl.bioontology.org/ontology/EDAM?conceptid=http%3A%2F%2Fedamontology.org%2Fdata_1278">
<class name="Pathway" extends="Annotatable" is-interface="true" term="">
<class name="LinkageGroup" extends="Annotatable" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000018">
<class name="QTL" extends="Annotatable" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0001645">
<class name="Phylotree" extends="Annotatable" is-interface="true" term="">
<class name="Trait" extends="Annotatable" is-interface="true" term="https://browser.planteome.org/amigo/term/TO:0000387">
<class name="GWASResult" extends="Annotatable" is-interface="true" term="">

Here are those that extend BioEntity, which extends Annotatable:

<class name="Protein" extends="BioEntity" is-interface="true" term="http://semanticscience.org/resource/SIO_010043,http://purl.uniprot.org/core/Protein">
<class name="SequenceFeature" extends="BioEntity" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000110,http://purl.obolibrary.org/obo/SO:0000110">
<class name="ProteinMatch" extends="BioEntity" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000349">
<class name="SequenceCollection" extends="BioEntity" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0001260,http://purl.obolibrary.org/obo/SO:0001260">

and here are those that extend SequenceFeature which extends BioEntity which extends Annotatable:

<class name="Intron" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000188,http://purl.obolibrary.org/obo/SO_0000188">
<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0005858">
<class name="TransposableElementInsertionSite" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000368"></class>
<class name="GoldenPathFragment" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000468"></class>
<class name="RepeatRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000657"></class>
<class name="IntergenicRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000605,http://purl.obolibrary.org/obo/SO:0000605">
<class name="Pseudogene" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000336"></class>
<class name="MobileGeneticElement" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0001037"></class>
<class name="UTR" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000203,http://purl.obolibrary.org/obo/SO_0000203">
<class name="Gene" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000704,http://purl.obolibrary.org/obo/SO:0000704">
<class name="GeneFlankingRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000239">
<class name="Oligo" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000696"></class>
<class name="ChromosomeBand" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000341"></class>
<class name="CDS" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000316,http://purl.obolibrary.org/obo/SO:0000316">
<class name="Region" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:SO:0000001">
<class name="Supercontig" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000148"></class>
<class name="Exon" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000147,http://purl.obolibrary.org/obo/SO_0000147">
<class name="OverlappingESTSet" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0001262">
<class name="PCRProduct" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000006"></class>
<class name="CDSRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000851"></class>
<class name="PointMutation" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:1000008"></class>
<class name="Transcript" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0000673,http://purl.obolibrary.org/obo/SO:0000673">
<class name="Chromosome" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000340,http://purl.obolibrary.org/obo/SO_0000340"></class>
<class name="RegulatoryRegion" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0005836">
<class name="CDNAClone" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000317"></class>
<class name="BindingSite" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO:0000409"></class>
<class name="GeneticMarker" extends="SequenceFeature" is-interface="true" term="http://purl.obolibrary.org/obo/SO_0001645">

Nothing else extends Annotatable.

alancleary commented 1 year ago

Perfect. Thank you.

sammyjava commented 1 year ago

And I'd say anything that extends Annotatable in the mine should implement Annotatable in GraphQL, meaning it resolves

primaryIdentifier
ontologyAnnotations
publications

Anything that extends BioEntity should implement the above resolvers PLUS the following subset of BioEntity fields

description
symbol
name
assemblyVersion
annotationVersion
secondaryIdentifier
organism
strain
locations
dataSets
locatedFeatures

Finally, anything that extends SequenceFeature should implement the above resolvers PLU the following subset of SequenceFeature fields:

length
sequenceOntologyTerm
supercontigLocation
chromosomeLocation
supercontig
chromosome

There are others (e.g. overlappingFeatures) we could decide to implement, but I don't think we need to do that at this point, I'd wait until there's an actual need.

alancleary commented 1 year ago

I agree that waiting is a good idea. In fact, I'm only going to make sure that the current implementation is consistent with itself in this PR. I'll open a separate issue to implement the other types you described.

sammyjava commented 1 year ago

Finally, note that PanGeneSet was added in 5.1.0.3 and hasn't yet been implemented in graphql-server. The other 5.1.0.3 change is in GeneFamilyTally, which is not Annotatable and I'll update those in the 5.1.0.3 branch after merging in the latest greatest.

sammyjava commented 1 year ago

I agree that waiting is a good idea. In fact, I'm only going to make sure that the current implementation is consistent with itself in this PR. I'll open a separate issue to implement the other types you described.

I don't think we need to implement every InterMine class as a type in GraphQL. I've generally gone with things I think might be needed by components. Same with fields, stuff like overlappingFeatures or childFeatures is getting down into the weeds and need an actual use case to warrant implementation, IMO.

alancleary commented 1 year ago

I agree that waiting is a good idea. In fact, I'm only going to make sure that the current implementation is consistent with itself in this PR. I'll open a separate issue to implement the other types you described.

I don't think we need to implement every InterMine class as a type in GraphQL. I've generally gone with things I think might be needed by components. Same with fields, stuff like overlappingFeatures or childFeatures is getting down into the weeds and need an actual use case to warrant implementation, IMO.

For sure. I meant more like any type that's in the GraphQL API that is annotatable in InterMine should also be annotatable in GraphQL.

alancleary commented 1 year ago

Done in PR #51. Closing.

legumeinfo / graphql-server

Implement summaryPath request in search-genes to support pagination metadata #43