Retrieve "recommended" dataset when no ID given in filter string

welchr commented 3 years ago

Synopsis

Gene, recombination rate, and GWAS catalog endpoints can be queried without supplying an id (or source in the case of genes.) LD will have to come in a future update to LDServer.

Instead, you can supply a build query parameter, and the API server will select the recommended dataset for that given build.

Currently this is done with a DB view that tracks the recommended id per build and “dataset”:

locuszoom=# select * from rest.recommended;
 id | genome_build |    db_table
----+--------------+----------------
  2 | GRCh37       | gene_master
  1 | GRCh38       | gwascat_master
  1 | GRCh38       | gene_master
  2 | GRCh37       | gwascat_master
 15 | GRCh37       | recomb

The view is automatically updated as new datasets are loaded.

The only notable difference in response for each query is a meta section:

   "meta": {
       "datasets": [
           {
               "version": "e100_r2020-07-14",
               "date_inserted": "2020-08-05T17:59:24.215339+00:00",
               "genome_build": "GRCh37",
               "id": 6,
               "name": "EBI GWAS Catalog"
           }
       ]
   }

This way, the client will know which dataset was selected by the API server and have information about it readily available, without having to execute a separate metadata query.

The reason for a list is because this happens always now, not just in the “use recommended ID” case. For the original use cases, the user could sometimes supply a filter like ‘id in 23,24,25’, and so there could be multiple datasets being returned in a single query.

Example queries

Only deployed on dev/staging currently.

Recombination

https://portaldev.sph.umich.edu/api_internal_dev/v1/annotation/recomb/results/?filter=chromosome eq '10' and position le 115067678 and position ge 114550452&build=GRCh37

Genes

https://portaldev.sph.umich.edu/api_internal_dev/v1/annotation/genes/?filter=chrom eq '10' and start le 115067678 and end ge 114550452&build=GRCh37

GWAS Catalog

https://portaldev.sph.umich.edu/api_internal_dev/v1/annotation/gwascatalog/results/?format=objects&sort=pos&filter=chrom eq '9' and pos ge 21751670 and pos le 22351670&build=GRCh37

Docs

Updated here: https://github.com/statgen/locuszoom-api/blob/feature/recommended-ids/docs/api.md

Deployment notes

Per Ryan, "DB changes are tracked in another repository": https://github.com/statgen/locuszoom-db/tree/feature/recommended-ids

welchr commented 3 years ago

@abought Recent commits also have:

If source/dataset ID and build are both provided, it will check that the ID is valid for the given build
catalog_version -> version for consistency

Missing:

date_inserted would be difficult to include now since it isn't known for all data sources, started being added during GWAS catalog. Could possibly add it to everything in the future if it were useful.

This is deployed on the dev server as well.

abought commented 3 years ago

date_inserted would be difficult to include now since it isn't known for all data sources, started being added during GWAS catalog. Could possibly add it to everything in the future if it were useful.

nods Backfilling would certainly be painful (though it could probably be done by hand from a list of dataset release timestamps... if you ever decide that I've sassed you one too many times and need to be assigned this task to teach me a lesson)

Other options would include making the field nullable, or a "start the better process going forward" plan: one some services we've gotten by with synthetic data, like using the current timestamp as the default value during DB migration. (newer datasets would receive newer timestamps in the future) Insofar as date inserted != date released anyway, one more fudge factor isn't the worst thing ever. 😛

By no means is this field mandatory- certainly for LocusZoom purposes, no human will ever see this field directly! It's a little bit of polish that wouldn't affect a release in my eyes, at all. If we do plan to add the field in the future, my only target would be to use consistent meta nomenclature across all endpoints. (it just makes the API nicer to use)

welchr commented 3 years ago

@abought How does deploy to production tomorrow (2/24) @ 8 PM EST sound? Friday evening or Sunday afternoon would also work well for me too.

abought commented 3 years ago

@abought How does deploy to production tomorrow (2/24) @ 8 PM EST sound? Friday evening or Sunday afternoon would also work well for me too.

Tell me more about how days of the week are different. It sounds like a fascinating concept.

Any of those deploy times sound good! AFAIK, the migration should be fairly small and quick. Let me know if there are prep or standby activities that I can do to help it go smoothly.

welchr commented 3 years ago

Let's go with 2/24 @ 8 PM then. Should be pretty quick... 🤞

statgen / locuszoom-api