opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Variant Page widgets FE development #3318

Open buniello opened 6 months ago

buniello commented 6 months ago

As part of the Variant Page effort, we have discussed to start developing the first two widgets (sample data has been shared on slack):

{
  "alleleOrigins": [
    "germline",
    "maternal",
    "paternal"
  ],
  "allelicRequirements": [
    "Autosomal recessive inheritance"
  ],
  "approvedSymbol": "POLG",
  "clinicalSignificances": [
    "pathogenic"
  ],
  "cohortPhenotypes": [
    "Alpers Syndrome",
    "Alpers diffuse degeneration of cerebral gray matter with hepatic cirrhosis",
    "Alpers disease",
    "Alpers progressive infantile poliodystrophy",
    "Alpers-Huttenlocher Syndrome",
    "Diffuse cerebral degeneration in infancy",
    "Infantile poliodystrophy",
    "Mitochondrial DNA Depletion Syndrome 4A",
    "Mitochondrial DNA depletion syndrome 4A (Alpers type)",
    "Neuronal degeneration of childhood with liver disease, progressive",
    "Poliodystrophia cerebri progressiva",
    "Progressive cerebral poliodystrophy",
    "Progressive sclerosing poliodystrophy"
  ],
  "confidence": "criteria provided, multiple submitters, no conflicts",
  "directionOnTrait": "risk",
  "disease": {
    "id": "MONDO_0008758",
    "name": "mitochondrial DNA depletion syndrome 4a"
  },
  "diseaseFromSource": "Progressive sclerosing poliodystrophy",
  "diseaseId": "EFO_0000508",
  "diseaseName": "genetic disorder",
  "literature": [
    "11431686",
    "11571332",
    "12565911",
    "14694057",
    "15122711",
    "15477547",
    "15824347",
    "16130100",
    "16177225",
    "17426723",
    "19251978",
    "21276947",
    "26942291",
    "26942292",
    "632821"
  ],
  "studyId": "RCV000014443",
  "targetId": "ENSG00000140521",
  "variantId": "15_89327201_C_T"
}
{
  "variantId": "17_7674230_C_T",
  "confidence": "high",
  "diseaseFromSource": "Li-Fraumeni syndrome",
  "literature": [
    "1978757",
    "1394225"
  ],
  "targetFromSourceId": "P04637",
  "target": {
    "id": "ENSG00000141510",
    "approvedSymbol": "TP53"
  },
  "disease": {
    "id": "MONDO_0018875",
    "name": "Li-Fraumeni syndrome"
  }
}
buniello commented 6 months ago

The widgets above have undergone a series of changes during the implementation process, including removal of the gene and DoE on trait columns and revision of sub-header text.

UniProt Variant widget new task:

buniello commented 6 months ago

Next in line for implementation are:

buniello commented 6 months ago

In silico predictors widget — First draft based on this schema shared in channel:


├───inSilicoPredictors: array 
      │   ├───element: struct 
      │   │   ├───method : string
      │   │   ├───assessment : string
      │   │   ├───flag : string
      │   │   ├───score : float

and sample dataset:

"inSilicoPredictors": [
  {
    "method": "alphaMissense",
    "score": 0.077,
    "assessment": "likely_benign"
  },
  {
    "method": "phred scaled CADD",
    "score": 7.293
  },
  {
    "method": "sift max",
    "score": 0.2,
    "assessment": "MODERATE"
  },
  {
    "method": "polyphen max",
    "score": 0.069,
    "assessment": "tolerated"
  },
  {
    "method": "loftee",
    "assessment": "high-confidence LoF variant",
    "flag": "PHYLOCSF_WEAK"
  }
]

Column 1 : method e.g. alphaMissense — COLUMN HEADER: Method — Tooltip: method description (tbd)? (sorting method column alphabetically) Column 2 : assessment e.g. likely_benign — COLUMN HEADER: Prediction — Tooltip: flag e.g. PHYLOCSF_WEAK (most severe?) Column 3: score e.g. 0.077. — COLUMN HEADER: Score

NOTE for FE: we could use a colour code for the assessments (varsome some uses a traffic light code). We have done something similar with the pharmacogenetics widget (Confidence level column) already and we could use same palette.

Some points on the in silico predictors widget discussed when @DSuveges was away:

ireneisdoomed commented 5 months ago

Pharmacogenetics widget

I've picked a few variants in genes known to play a role in drug responses. I think the variant page will be an interesting entry point for doctors/researchers that have observed a specific variant in a patient, so potentially they come without prior knowledge. The intention is to test the UX of these well known variants and see that it's not difficult to interact with the data and get insights.

variant gene evidence count
rs3892097 TPMT 4
rs9923231 VKORC1 27
rs67376798 DPYD 13
rs3892097 CYP2D6 3
rs4149056 SLCO1B1 159
rs4244285 CYP2C19 30

In this first iteration, we want to reporduce the current PGx widget without some of the variant metadata columns. A toy dataset with all evidence (236) for the above variants is here gs://ot-team/irene/variant_page/pgx_30-05-2024.json

root
 |-- datasourceId: string (nullable = true)
 |-- drugs: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- evidenceLevel: string (nullable = true)
 |-- genotypeAnnotationText: string (nullable = true)
 |-- genotypeId: string (nullable = true)
 |-- isDirectTarget: boolean (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pgxCategory: string (nullable = true)
 |-- phenotypeFromSourceId: string (nullable = true)
 |-- phenotypeText: string (nullable = true)
 |-- studyId: string (nullable = true)

An important consideration with this data is that the evidence is not indexed by variantId, as other sources. Here we have more granularity, with genotype identifiers. So in order to choose which PGx evidence to show in the variant page, we will be matching the chromosome and position of the variant ID, with the chromosome and position of the genotype ID. For example, for the variant 16_31096368_C_T, we display all evidence where genotypeId starts with1631096368`.

In terms of sorting, we want to prioritise the most confident ones (evidence level), and if it is not intricate, I'd suggest showing evidence that report toxicity first (pgxCategory).

ireneisdoomed commented 5 months ago

The approach to generate the sample data above was not good, I used the files instead of just exporting the response of the API query.

I've generated a very similar dataset extracted from the API: pharmacogenomics_sample.json

gist to reproduce query
``` import requests import json url = "https://api.platform.opentargets.org/api/v4/graphql" query = """ query PharmacogenomicsQuery($ensemblId: String!) { target(ensemblId: $ensemblId) { id pharmacogenomics { variantRsId genotypeId isDirectTarget drugFromSource drugId phenotypeFromSourceId genotypeAnnotationText phenotypeText pgxCategory evidenceLevel datasourceId studyId literature } } } """ genes_to_query = [ "ENSG00000134538", "ENSG00000167397", "ENSG00000165841", "ENSG00000188641", "ENSG00000100197", "ENSG00000137364", ] variants_of_interest = [ "rs4149056", "rs67376798", "rs4244285", "rs3892097", "rs9923231", "rs1800460", ] all_results = [] for ensembl_id in genes_to_query: variables = {"ensemblId": ensembl_id} response = requests.post( url, json={"query": query, "variables": variables} ) response.raise_for_status() data = response.json() pharmacogenomics_data = data["data"]["target"]["pharmacogenomics"] filtered_data = [ {k: v for k, v in entry.items() if k != "variantRsId"} for entry in pharmacogenomics_data if entry["variantRsId"] in variants_of_interest ] all_results.extend(filtered_data) with open("pharmacogenomics_sample.json", "w") as outfile: for entry in all_results: json.dump(entry, outfile) outfile.write("\n") ```

🚨Something important: the API for this widget is going to change once the work in #3205 is finished

buniello commented 5 months ago

@gjmcn

Pharmacogenetics widget sample data (see here** for ref)

{"genotypeId": "12_21178615_T_T,T", "isDirectTarget": false, "drugFromSource": "fluvastatin", "drugId": "CHEMBL2220442", "phenotypeFromSourceId": null, "genotypeAnnotationText": "Patients with the rs4149056 TT genotype may have decreased concentrations of fluvastatin as compared to patients with the CC or CT genotypes. However, conflicting evidence has been reported. Other genetic and clinical factors may also affect fluvastatin concentrations. This annotation only covers the pharmacokinetic relationship between rs4149056 and fluvastatin and does not include evidence about clinical outcomes.", "phenotypeText": "decreased concentrations of fluvastatin", "pgxCategory": "metabolism/pk", "evidenceLevel": "1A", "datasourceId": "pharmgkb", "studyId": "1451244700", "literature": ["17015053", "30989645"]}
{"genotypeId": "12_21178615_T_C,T", "isDirectTarget": false, "drugFromSource": "lovastatin", "drugId": "CHEMBL503", "phenotypeFromSourceId": null, "genotypeAnnotationText": "TPatients with the rs4149056 CT genotype may have an increased risk of lovastatin-related myopathy when treated with lovastatin as compared to patients with the TT genotype. Other genetic and clinical factors may also influence risk of toxicity to lovastatin.", "phenotypeText": "increased risk of lovastatin-related myopathy", "pgxCategory": "toxicity", "evidenceLevel": "1A", "datasourceId": "pharmgkb", "studyId": "1451465324", "literature": ["34114646"]}
{"genotypeId": "12_21178615_T_T,T", "isDirectTarget": false, "drugFromSource": "lopinavir", "drugId": "CHEMBL729", "phenotypeFromSourceId": null, "genotypeAnnotationText": "Patients with HIV and the TT genotype may have decreased plasma levels of lopinavir as compared to patients with the CC genotype. However, one study failed to find this association. Other genetic and clinical factors may also influence lopinavir concentrations in a patients. This annotation only covers the pharmacokinetic relationship between rs4149056 and lopinavir and does not include evidence about clinical outcomes.", "phenotypeText": "decreased plasma levels of lopinavir", "pgxCategory": "metabolism/pk", "evidenceLevel": "3", "datasourceId": "pharmgkb", "studyId": "1444704359", "literature": ["20051929", "20078617", "21743379", "23503447", "32022294", "27142945", "28718515"]}

Column 1: genotypeId e.g. 12_21178615_T_T,T -- COLUMN HEADER: Genotype ID -- Tooltip on header: [VCF-style(chr_pos_ref_allele1,allele2). See here for more details.]

Column 2: drugFromSource e.g. fluvastatin (hyperlink to drugId e.g. https://platform.opentargets.org/drug/drugId) -- COLUMN HEADER: Drug(s)

Column 3: phenotypeText [with tooltip: genotypeAnnotationText]e.g. decreased concentrations of fluvastatin-- COLUMN HEADER: Drug Response Phenotype

Column 4: pgxCategory e.g. metabolism/pk -- COLUMN HEADER: Drug Response Category

Column 5: isDirectTargete.g. false -- COLUMN HEADER: Direct Drug Target -- see visualisation for this column in current widget

Column 6: evidenceLevel e.g. 1A -- COLUMN HEADER: Confidence Level (colour coded) -- Tooltip: As defined by PharmGKB ClinAnn Levels [column with sorting arrow]

Column 7: datasourceId e.g. pharmgkb(hyperlinked to studyId e.g. https://www.pharmgkb.org/clinicalAnnotation/1451244700) -- COLUMN HEADER: Source

Column 8: Literature e.g. [17015053, 30989645] -- COLUMN HEADER: Literature

buniello commented 5 months ago

@gjmcn - let me know if there are questions on this!

Credible sets Widget

Sample dataset used for the table:

{
  "variantId": "10_100315722_G_A",
  "study": {
    "id": "GCST001217",
    "traitFromSource": "Metabolic traits",
    "disease": {
      "id": "EFO_0004725",
      "name": "Metabolic traits"
    }
  },
  "pValueMantissa": 3.0,
  "pValueExponent": -57,
  "beta": 0.124,
  "ldPopulationStructure": [
    {
      "ldPopulation": "nfe",
      "relativeSampleSize": 1.0
    }
  ],
  "finemappingMethod": "pics",
  "l2g": {
    "score": 0.36516955494880676,
    "target": {
      "id": "ENSG00000107593",
      "approvedSymbol": "PKD2L1"
    }
  },
  "locus": [
 {
      "variantId": "10_100315722_G_A",
      "r2Overall": 1.0000000000000049,
      "posteriorProbability": 1.0,
      "standardError": 0.9999989208874888,
      "is95CredibleSet": true,
      "is99CredibleSet": true
    }
  ]
}

Credible Sets

Column 1: variantId e.g. 10_100315722_G_A — Column Header: Lead Variant NOTE:

Column 2: From “disease”: name e.g. Metabolic traits hyperlinked to Id e.g. https://platform.opentargets.org/disease/`EFO_0004725` — Column Header: Trait

Column 3: From “study”: id e.g. GCST001217 hyperlinked to https://www.ebi.ac.uk/gwas/studies/id — Column Header: Study NOTE: this row will also open a study metadata drawer in future iteration [metadata drawer including PMID, ancestry, sample size, author name etc tbd]

Column 4: pValueMantissa & pValueExponent e.g. 3.0-57 — Column Header: P-Value (sorting arrow) NOTE: table will be sorted by this value

Column 5: beta e.g. 0.124 — Column header: Beta — Tooltip: Beta with respect to the ALT allele

Column 6: From “locus”: r2Overall e.g. 1.00 (two decimals figures) — Column Header: LD (r2) — Tooltip: Linkage disequilibrium with the queried variant

Column 7: finemappingMethod e.g. pics — Column Header: Finemapping method

Column 8: From “l2g - target": approvedsymbol e.g. PKD2L1 hyperlinked to [https://platform.opentargets.org/target/id— Header name: Top L2G — Tooltip: Top gene prioritised by our locus-to-gene model

Column 9: From “l2g”: score e.g. 0.365 (three decimal figures) — Column Header: L2G score (sorting arrow)

Column 10: From “locus”: number of variant id fields within the locus object e.g. 1 for example used in this table — Column Header: Credible Set Size NOTE: this row will also a drawer in future iteration [locus drawer with PIP, variants in set, LD etc tbd)

Json file for sample data: test_variant_page7.json

Just adding here a screenshot from the relevant widget in current OTG variant page (for reference)

Screenshot 2024-05-30 at 10 19 44
d0choa commented 5 months ago

Looks awesome already. The 2 Columns that have a little bit of magic in my opinion are Column 1 and Column 6. I will give a little longer explanation in case there is any confusion, but I think @buniello description is already good.

@buniello for the next iteration we could decide if we want to collapse Column 8 and Column 9. Let see how it looks now but I can see some width savings there.

xyg123 commented 5 months ago

Here's the updated joining process, I thought I should post it since the whole variant index can be annotated this way. I did it for the GWAScat curated PICs results, since those are not going to change anymore going forwards.

I start by generating the credible set + l2g dataframe from joining together the credible_set, locus_to_gene_predictions, study_index and gene_index.

The schema for which looks like this
``` full_credsets=df_filtered.select( col("variantId").alias("leadVariantId"), struct( col("studyId").alias("id"), col("traitFromSource"), struct( col("traitFromSourceMappedIds").alias("id"), col("traitFromSource"), ).alias("disease"), ).alias("study"), col("pValueMantissa"), col("pValueExponent"), col("beta"), col("ldPopulationStructure"), col("finemappingMethod"), struct( col("score"), struct(col("geneId").alias("id"), col("approvedSymbol")).alias("target") ).alias("l2g"), col("locus") ) ```

Then I join the variant index to the dataframe above, looking for whenever a variant is found within a locus, and if so, extract its associated pvalue + posteriorprob.

Joining process
``` Variant_index=session.spark.read.parquet(f"{release_path}/{release_ver}/variant_index", recursiveFileLookup=True).limit(1000).select("variantId") join_condition = f.expr(""" array_contains(transform(locus, x -> x.variantId), variantId) """) joined_df = full_credsets.join(Variant_index, join_condition).persist() variant_full_credset=joined_df.withColumn( "posteriorProbability", f.expr(""" element_at(filter( transform(locus, x -> if(x.variantId = variantId, x.posteriorProbability, null) ), x -> x is not null ), 1) """)) ```

There's a minor inconvenience at the moment where the pValueExponent, Mantissa, and beta columns are not populated for the locus object. This makes sense for the tag SNPs in the PICS output (they didn't have them to start with), it means there's an extra step added to check for the lead, annotate with stats or else fill with null.

Final output
``` variant_full_credset.withColumn( "pValueExponent", f.when(f.col("variantId") == f.col("leadVariantId"), f.col("pValueExponent")).otherwise(f.lit(None)) ).withColumn( "pValueMantissa", f.when(f.col("variantId") == f.col("leadVariantId"), f.col("pValueMantissa")).otherwise(f.lit(None)) ).withColumn( "beta", f.when(f.col("variantId") == f.col("leadVariantId"), f.col("beta")).otherwise(f.lit(None)) ).select( col("variantId"), col("study"), col("pValueMantissa"), col("pValueExponent"), col("beta"), col("posteriorProbability"), col("ldPopulationStructure"), col("finemappingMethod"), col("l2g"), col("locus") ) ``` [test_variant_page8.json](https://github.com/user-attachments/files/15766136/test_variant_page8.json)
xyg123 commented 5 months ago

To address this comment from @d0choa : "ideally a variant that is sometimes a lead and sometimes a tag. That would help FE consider all cases"

I tried to get SNPs which matched this description in the GWAS sumstats PICS outputs, which again did not have the pvalue fields populated for the tags, so I've switched to the Finngen susie outputs, I took the 1st SNP I saw which matched this criteria:

test_single_variant_page.json

And ~250 other SNPs incase you need something bigger:

test_variant_page9.json

buniello commented 5 months ago
xyg123 commented 5 months ago

I anticipate that we'll need to go back and forth a few times to refine this, but here's the initial version of the widget. I've made an effort to match the input SNPs to those listed in "test_variant_page9.json". This way, you can create the test variant page incorporating both the credible set widget and the QTL widget.

Also, this file only contains SNPs which are both a lead and sometimes a tag.

{
  "variantId": "2_8302417_G_A",
  "study": {
    "id": "GTEx_brain_putamen_ENST00000668369",
    "studyType": "eqtl",
    "projectId": "GTEx"
  },
  "pValueMantissa": 2.359,
  "pValueExponent": -8,
  "beta": 0.694055,
  "posteriorProbability": 0.0248072063095605,
  "tissueFromSourceId": "UBERON_0001874",
  "target": {
    "id": "ENSG00000236790",
    "approvedSymbol": "LINC00299"
  },
  "finemappingMethod": "SuSie",
  "locus": [
    {
      "variantId": "2_8300216_T_C",
      "posteriorProbability": 0.0711130529884503,
      "pValueMantissa": 1.124,
      "pValueExponent": -8,
      "logBF": 17.6040572828625,
      "beta": 0.642905,
      "standardError": 0.106521,
      "is95CredibleSet": true,
      "is99CredibleSet": true
    }, 
         ...
    }

test_qtl_widget.json

gjmcn commented 5 months ago

@xyg123 For the GWAS credible sets widget, I just switched to test_single_variant_page.json for testing the widget but we seem to have lost the r2Overall property from the locus entries?

d0choa commented 5 months ago

This is a gap in the current data we are generating. @gjmcn we need to think about this, because it's not trivial to generate this column in some contexts. You can skip the column for now, until we figure out what to do.

@xyg123, @addramir we should think about this. I can see different scenarios. We might not have the R^2 because:

xyg123 commented 5 months ago

It is because I am using the finngen data for this, the alternative was to use the GWAS catalog PICS output, in which case we will lose the pvalue+beta fields for tag SNPs, happy to generate that if you would prefer

gjmcn commented 5 months ago

Also from the GWAS credible sets data change: study.disease.name which we we used for the trait column has gone. Can we use study.traitFromSource or study.disease.traitFromSource for the trait column now?

d0choa commented 5 months ago

The studies are expected in the same data structure as the GWAS credible sets. @xyg123 is it easy to use the same object?

gjmcn commented 5 months ago

Just to clarify, my comment about study.disease.name disappearing is about the GWAS credible sets - it is a result of @xyg123 using a new approach to process the data.

xyg123 commented 5 months ago

Sorry, it is just a matter of renaming the column from study.disease.traitFromSource to study.disease.name. Here you go (same SNPs as QTL widget): test_credible_set.json

The issue with the r2Overall is due to the different data source (Finngen instead of GWAScatalog), and Finngen doesn't provide the r2 values. I am still processing the data with the same approach.

xyg123 commented 5 months ago

Addressing @buniello 's request to have tissue labels mapped to the qtl widget test set, there were 32 entries in the test data that didn't match an uberon id:

test_qtl_widget2.json

{
  "variantId": "2_8302417_G_A",
  "study": {
    "id": "GTEx_brain_putamen_ENST00000668369",
    "projectId": "GTEx",
    "studyType": "eqtl"
  },
  "pValueMantissa": 2.359,
  "pValueExponent": -8,
  "beta": 0.694055,
  "posteriorProbability": 0.0248072063095605,
  "tissue": {
    "id": "UBERON_0001874",
    "label": "putamen",
    "organs": ["brain"],
    "anatomicalSystems": ["nervous system"]
  },
  "target": {
    "approvedSymbol": "LINC00299",
    "id": "ENSG00000236790"
  },
  "finemappingMethod": "SuSie",
  "locus": [
    {
      "variantId": "2_8300216_T_C",
      "posteriorProbability": 0.0711130529884503,
      "pValueMantissa": 1.124,
      "pValueExponent": -8,
      "logBF": 17.6040572828625,
      "beta": 0.642905,
      "standardError": 0.106521,
      "is95CredibleSet": true,
      "is99CredibleSet": true
    }, ...
}
d0choa commented 5 months ago

In case it's useful for future reference, this is the file that the Platform uses to build the tissue metadata gs://open-targets-data-releases/24.03/input/expression-inputs/tissue-translation-map.json

buniello commented 5 months ago

@gjmcn as discussed in the office, the QTLs credible set widget with almost be a clone of the GWAS credible set one. Below the main differences (which we can discuss tomorrow):

  1. Study links for the STUDY column: Discussed with Yakov - for this first implementation, we can have ProjectId (string) hyperlinked to this page in all cases: https://www.ebi.ac.uk/eqtl/Studies/. The study metadata card will add more context later.
  1. Additional columns -After STUDY column: studyType - Column header: TYPE -After TYPE column: from "tissue" object label hyperlinked to id - Column header: TISSUE

  2. Replacing TOP L2G column (for now) with a GENE column (no tooltip) - from "target" object approved symbol hyperlinked to https://platform.opentargets.org/target/`id'

General notes: shall we run/display top L2G with QTLs? Shall we display logBF anywhere?

xyg123 commented 5 months ago

Not sure about displaying the logBF, but we should definitely make it accessible somewhere, users will need it to run colocalisation.

buniello commented 5 months ago

@gjmcn

Discussed changed to current version of QTL credible sets widgets:

buniello commented 4 months ago

@gjmcn - discussed today

  1. the most severe consequence field of metadata section on variant page will display mostSevereConsequence label hyperlinked to http://purl.obolibrary.org/obo/`mostSevereConsequenceId` --- please use identifiers to build the right link (see comment below)
  2. The in silico predictors widget will display two sources for the data - VEP, gnomAD (hompages)
d0choa commented 4 months ago

Please try to use identifiers.org to build the link. You should find the same logic in Open Targets Genetics or ClinVar widgets.

buniello commented 4 months ago

this actually reminds me that we could re-use the VEP chip with variant consequence from the ClinVar widget in VEP widget

buniello commented 4 months ago

@gjmcn: Please note that the new variant index API field hgvsId should be visualised on variant page subheaded together with rsIds and dbXrefs

DSuveges commented 4 months ago

Some updates on the credible set schema:

prashantuniyal02 commented 3 months ago

3405 - This has been tested in the 14/08 meeting. It looks good. @gjmcn , FE work can start on the pharmacogenomics widget.