statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.
MIT License
158 stars 66 forks source link

Versioning and docs for API #111

Open welchr opened 6 years ago

welchr commented 6 years ago

Current API endpoint:

http://pheweb.sph.umich.edu:5000/api/variant/10:114758349-C-T

If the T2D portal is going to hit this for their UKBB PheWAS queries, we would need to provide a versioned URL (in case of backwards incompatible changes.) Also some documentation or possibly a metadata endpoint describing the current dataset (e.g. what is the imputation, what build, etc.)

@pjvandehaar Would you be able to look at this sometime soon?

welchr commented 6 years ago

Also does this API endpoint have the phenotype groupings? Is that what is prefixed onto the phenostrings?

pjvandehaar commented 6 years ago

In a title like 20002_1220: Non-cancer illness code, self-reported: diabetes, 20002_1220 is the UKB code for the phenotype, and Non-cancer... is the UKB name for the phenotype. In the JSON they're called phenocode and phenostring. We don't have categories loaded for this dataset, but I think I can add them quickly if you would use them.

pjvandehaar commented 6 years ago

Unfortunately, the categories are not very balanced, so they will be hard to render on a PheWAS plot. There are large categories and dozens of categories with <10 phenotypes. To display them, I can manually merge some similar categories.

pjvandehaar commented 6 years ago

I believe that this API endpoint is still missing some data. Some phenotypes are missing sections of the genome, because we ran out of disk space on the loading machine, and fixing it was never a high priority. We've discussed this before, but I still want to be sure you're okay with that.

pjvandehaar commented 6 years ago

For the metadata endpoint, what are you looking for? How's /api/v1/metadata.json:

{
  "build": "GRCh37",
  "description": "Analysis of UKB data by Ben Neale's lab, round 1, imputed using UK10K and HRC."
  "link": "http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank"
}

Should imputation get its own field? How are you hoping to use it? Perhaps maximum sample size as well?

welchr commented 6 years ago

Apologies, Peter - I think we can actually put this on hold for now (at least from my end.)

So far the LZ API seems to be able to serve the UKBB SAIGE HRC analysis, and my guess is we will probably go with that barring any major issues appearing. It makes it easier on the Broad since they're already setup to handle our PheWAS requests and responses.

In the long run, though, I don't think Postgres will be able to handle much more than this. So we may need to revert back to a PheWeb-like API backed by your storage solution. It's still worth considering the points above about versioning, docs, metadata, etc., I think.

Regarding metadata - an imputation field would be good, to let people know which panel was used when imputing the genotypes used in the analysis. Something as simple as "HRC" or "1000G Phase 3" or "TOPMED" is at least helpful and better than nothing. Including it in the description is another option, but without making it a required field, it can be forgotten.