opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Download ChEMBL 28 ES dump #1419

Closed d0choa closed 3 years ago

d0choa commented 3 years ago

ChEMBL 28 is now live. Can we:

We want to anticipate any unexpected changes and also scope the work that will be needed to accommodate the Black Box warnings.

cmalangone commented 3 years ago

Next Tuesday @JarrodBaker and I are going to add few features to the current PIS implementation.

) Run PIS internally (using ebi intranet) ) Integrate scritpt to compare the schema (Jarrod already did it) *) Copy the files using rclone (already implemented for copy from gs to ftp)

JarrodBaker commented 3 years ago

I've added a repository with some helper scripts to make it easier to compare ChEMBL releases.

I've compared the differences for the 27 -> 28 chembl release cycle, and it looks like there are no breaking changes for us.

d0choa commented 3 years ago

We are interested on the black box warnings data ChEMBL added in the last release. Blog post

Could you please paste here the schema, so @andrewhercules and I can scope the work. It's part of the molecule index and it should be under drug warnings (or something similar). We were using part of the information, but now it's richer.

JarrodBaker commented 3 years ago

@andrewhercules @d0choa

Here is a simplified* representation of the schema:

{
  "_metadata": {
    "activity_count": "long",
    "atc_classifications": {
      "level1": "keyword",
      "level1_description": "keyword",
      "level2": "keyword",
      "level2_description": "keyword",
      "level3": "keyword",
      "level3_description": "keyword",
      "level4": "keyword",
      "level4_description": "keyword",
      "level5": "keyword",
      "who_name": "keyword"
    },
    "compound_generated": {
      "availability_type_label": "keyword",
      "chirality_label": "keyword",
      "image_file": "keyword"
    },
    "compound_records": {
      "compound_key": "keyword",
      "compound_name": "keyword",
      "src_description": "keyword",
      "src_id": "keyword",
      "src_short_name": "keyword"
    },
    "compound_structural_alerts": {
      "alert_count": "long",
      "alerts": {
        "alert": {
          "alert_id": "keyword",
          "alert_name": "keyword",
          "alert_set": { "priority": "integer", "set_name": "keyword" },
          "smarts": "keyword"
        },
        "cpd_str_alert_id": "keyword",
        "molecule_chembl_id": "keyword"
      }
    },
    "disease_name": "keyword",
    "drug": {
      "drug_data": {
        "applicants": "keyword",
        "atc_classification": { "code": "keyword", "description": "keyword" },
        "availability_type": "short",
        "biotherapeutic": {
          "biocomponents": {
            "component_id": "keyword",
            "component_type": "keyword",
            "description": "keyword",
            "organism": "keyword",
            "sequence": "keyword",
            "tax_id": "keyword"
          },
          "description": "keyword",
          "helm_notation": "keyword",
          "molecule_chembl_id": "keyword"
        },
        "black_box": "boolean",
        "chirality": "short",
        "development_phase": "short",
        "drug_type": "short",
        "drug_warnings": {
          "warning_class": "keyword",
          "warning_country": "keyword",
          "warning_description": "keyword",
          "warning_id": "keyword",
          "warning_refs": {
            "ref_id": "keyword",
            "ref_type": "keyword",
            "ref_url": "keyword"
          },
          "warning_type": "keyword",
          "warning_year": "short"
        },
        "first_approval": "short",
        "first_in_class": "boolean",
        "helm_notation": "keyword",
        "indication_class": "keyword",
        "molecule_chembl_id": "keyword",
        "molecule_properties": {
          "alogp": "double",
          "aromatic_rings": "integer",
          "cx_logd": "double",
          "cx_logp": "double",
          "cx_most_apka": "double",
          "cx_most_bpka": "double",
          "full_molformula": "keyword",
          "full_mwt": "double",
          "hba": "integer",
          "hba_lipinski": "integer",
          "hbd": "integer",
          "hbd_lipinski": "integer",
          "heavy_atoms": "integer",
          "molecular_species": "keyword",
          "mw_freebase": "double",
          "mw_monoisotopic": "double",
          "num_lipinski_ro5_violations": "short",
          "num_ro5_violations": "short",
          "psa": "double",
          "qed_weighted": "double",
          "ro3_pass": "keyword",
          "rtb": "integer"
        },
        "molecule_structures": {
          "canonical_smiles": "keyword",
          "molfile": "text",
          "standard_inchi": "keyword",
          "standard_inchi_key": "keyword"
        },
        "molecule_synonyms": {
          "molecule_synonym": "keyword",
          "syn_type": "keyword",
          "synonyms": "keyword"
        },
        "ob_patent": "keyword",
        "oral": "boolean",
        "parenteral": "boolean",
        "prodrug": "boolean",
        "research_codes": "keyword",
        "rule_of_five": "boolean",
        "sc_patent": "keyword",
        "synonyms": "keyword",
        "topical": "boolean",
        "usan_stem": "keyword",
        "usan_stem_definition": "keyword",
        "usan_stem_substem": "keyword",
        "usan_year": "short",
        "withdrawn_class": "keyword",
        "withdrawn_country": "keyword",
        "withdrawn_reason": "keyword",
        "withdrawn_year": "short"
      },
      "is_drug": "boolean"
    },
    "drug_indications": {
      "_metadata": { "all_molecule_chembl_ids": "text" },
      "drugind_id": "keyword",
      "efo_id": "keyword",
      "efo_term": "keyword",
      "indication_refs": {
        "ref_id": "keyword",
        "ref_type": "keyword",
        "ref_url": "keyword"
      },
      "max_phase_for_ind": "short",
      "mesh_heading": "keyword",
      "mesh_id": "keyword",
      "molecule_chembl_id": "keyword",
      "parent_molecule_chembl_id": "keyword"
    },
    "es_completion": "completion",
    "hierarchy": {
      "all_family": {
        "chembl_id": "keyword",
        "inchi": "keyword",
        "inchi_connectivity_layer": "keyword",
        "inchi_key": "keyword"
      },
      "children": {
        "chembl_id": "keyword",
        "sources": {
          "src_description": "keyword",
          "src_id": "short",
          "src_short_name": "keyword"
        },
        "synonyms": {
          "molecule_synonym": "keyword",
          "syn_type": "keyword",
          "synonyms": "keyword"
        }
      },
      "family_inchi_connectivity_layer": "keyword",
      "is_approved_drug": "boolean",
      "is_usan": "boolean",
      "parent": {
        "chembl_id": "keyword",
        "sources": {
          "src_description": "keyword",
          "src_id": "short",
          "src_short_name": "keyword"
        },
        "synonyms": {
          "molecule_synonym": "keyword",
          "syn_type": "keyword",
          "synonyms": "keyword"
        }
      }
    },
    "related_activities": { "count": "integer" },
    "related_assays": { "all_chembl_ids": "text", "count": "integer" },
    "related_cell_lines": { "all_chembl_ids": "text", "count": "integer" },
    "related_documents": { "all_chembl_ids": "text", "count": "integer" },
    "related_targets": {
      "all_chembl_ids": "text",
      "chembl_ids": "object",
      "count": "long"
    },
    "related_tissues": { "all_chembl_ids": "text", "count": "integer" },
    "tags": "keyword",
    "unichem": {
      "id": "keyword",
      "link": "keyword",
      "src_name": "keyword",
      "src_url": "keyword"
    }
  },
  "atc_classifications": "keyword",
  "availability_type": "short",
  "biotherapeutic": {
    "biocomponents": {
      "component_id": "keyword",
      "component_type": "keyword",
      "description": "keyword",
      "organism": "keyword",
      "sequence": "keyword",
      "tax_id": "keyword"
    },
    "description": "keyword",
    "helm_notation": "keyword",
    "molecule_chembl_id": "keyword"
  },
  "black_box_warning": "keyword",
  "chebi_par_id": "keyword",
  "chirality": "integer",
  "cross_references": {
    "xref_id": "keyword",
    "xref_name": "keyword",
    "xref_src": "keyword",
    "xref_src_url": "text",
    "xref_url": "text"
  },
  "dosed_ingredient": "boolean",
  "drug_warnings": {
    "warning_class": "keyword",
    "warning_country": "keyword",
    "warning_description": "keyword",
    "warning_id": "keyword",
    "warning_refs": {
      "ref_id": "keyword",
      "ref_type": "keyword",
      "ref_url": "keyword"
    },
    "warning_type": "keyword",
    "warning_year": "short"
  },
  "first_approval": "short",
  "first_in_class": "integer",
  "helm_notation": "keyword",
  "indication_class": "keyword",
  "inorganic_flag": "short",
  "max_phase": "short",
  "molecule_chembl_id": "keyword",
  "molecule_hierarchy": {
    "molecule_chembl_id": "keyword",
    "parent_chembl_id": "keyword"
  },
  "molecule_properties": {
    "alogp": "double",
    "aromatic_rings": "integer",
    "cx_logd": "double",
    "cx_logp": "double",
    "cx_most_apka": "double",
    "cx_most_bpka": "double",
    "full_molformula": "keyword",
    "full_mwt": "double",
    "hba": "integer",
    "hba_lipinski": "integer",
    "hbd": "integer",
    "hbd_lipinski": "integer",
    "heavy_atoms": "integer",
    "molecular_species": "keyword",
    "mw_freebase": "double",
    "mw_monoisotopic": "double",
    "num_lipinski_ro5_violations": "short",
    "num_ro5_violations": "short",
    "psa": "double",
    "qed_weighted": "double",
    "ro3_pass": "keyword",
    "rtb": "integer"
  },
  "molecule_structures": {
    "canonical_smiles": "keyword",
    "molfile": "text",
    "standard_inchi": "keyword",
    "standard_inchi_key": "keyword"
  },
  "molecule_synonyms": {
    "molecule_synonym": "keyword",
    "syn_type": "keyword",
    "synonyms": "keyword"
  },
  "molecule_type": "keyword",
  "natural_product": "short",
  "oral": "boolean",
  "parenteral": "boolean",
  "polymer_flag": "boolean",
  "pref_name": "keyword",
  "prodrug": "short",
  "structure_type": "keyword",
  "therapeutic_flag": "boolean",
  "topical": "boolean",
  "usan_stem": "keyword",
  "usan_stem_definition": "keyword",
  "usan_substem": "keyword",
  "usan_year": "short",
  "withdrawn_class": "keyword",
  "withdrawn_country": "keyword",
  "withdrawn_flag": "boolean",
  "withdrawn_reason": "keyword",
  "withdrawn_year": "short"
}

If you need to explore any other indices I've uploaded them to gs://ot-team/jarrod/chembl28.

andrewhercules commented 3 years ago

Thank you @JarrodBaker!

I've done a review of the schema and example data and have some points for @d0choa to consider:

  1. It appears as though the withdrawn information is still available from the /molecule endpoint. It is present in the existing withdrawn_class, withdrawn_country, withdrawn_flag, withdrawn_reason, and withdrawn_year fields. It is also now available in the new drug_warnings object.

For example, withdawn information for Tegaserod - CHEMBL76370 is available from ChEMBL's /molecule endpoint - see https://www.ebi.ac.uk/chembl/api/data/molecule/CHEMBL76370.json.

Screenshot 2021-03-08 at 20 02 49

It is also available in the drug_warning endpoint - see https://www.ebi.ac.uk/chembl/api/data/drug_warning/?format=json&limit=500&offset=500.

Screenshot 2021-03-08 at 20 03 11

However, it would be good to confirm with ChEMBL that all withdrawn warning data previously available from the /molecule endpoint is also available in the drug_warnings object.

In theory, this would mean that we could run the existing drug ETL pipeline with CHEMBL28 and it should not fail as it would bypass the drug_warnings object and use the pre-existing withdrawn fields available from the /molecule endpoint.

  1. Within the drug_warnings object, ChEMBL refer to both the parent and child molecule. Will we propagate both blackbox and withdrawn warnings between the parent <-> child molecules?

  2. In the warning_types field, the only options I saw were "Withdrawn" and "Black Box Warning", but would be good to confirm with ChEMBL as we will want to string-match to know which type of warning to show.

  3. It is possible to have molecules with more than 1 blackbox warnings and more than 1 withdrawn warnings. For example:

    • ChEMBL651 has multiple black box warnings
    • CHEMBL122 has multiple withdrawn warnings

You can also have molecules with both types of warnings - for example CHEMBL121.

Based on my assessment, I would recommend we adjust the drug ETL pipeline to use the drug_warnings data as it is much richer, containing a reference link and/or PubMed ID. Also, there are separate entries in drug_warnings for different types of blackbox warnings (e.g. cardiotoxicity, neurotoxicity) and each type might have a different reference or year. However, this would mean that we have to iterate through the entire drug_warnings index with each ChEMBL ID as there could be separate entries that we would want to aggregate and include as a list in our drug index.

d0choa commented 3 years ago

Great analysis @andrewhercules. We should go through these points and agree on some actions.

As other points to consider, Fiona confirmed the black box warnings were in origin MedDRA terms and we should be able to map them to their IDs similarly to the pharmacovigilance pipeline.

cc @ireneisdoomed (to keep you in the loop)

ireneisdoomed commented 3 years ago

I've contacted ChEMBL to ask all the points raised by @andrewhercules and the potential data bug spotted by @d0choa. I will inform you as soon as they get back to me.

ireneisdoomed commented 3 years ago

Fiona has replied to me with the differences between the drug_warnings and the molecule endpoints:

  1. Although /molecule includes withdrawn information this is expected to change for ChEMBL 29 (but the high level summary information for withdrawn_flag and black_box_warning (yes/no flag) fields will remain within the molecule API endpoint).
  2. The warning information is annotated on a per salt basis but /drug_warning aggregates data on the parent molecule. We have to take into account that this endpoint use a surrogated id: meaning that the queries differ based on the nature of the molecule. So for a example like TEGASEROD MALEATE where this salt is withdrawn :
    • the query to /drug_warning for the salt is built like this: https://www.ebi.ac.uk/chembl/api/data/drug_warning.json?molecule_chembl_id=CHEMBL1516474
    • the query to /drug_warning for the parent is built like this: https://www.ebi.ac.uk/chembl/api/data/drug_warning.json?parent_molecule_chembl_id=CHEMBL76370
    • the equivalent information for the salt call is present in /molecule: https://www.ebi.ac.uk/chembl/api/data/molecule/CHEMBL1516474.json

In this case all of them have the same result: image

  1. One last point to consider is that they display the info similarly as they do both for MoAs and indications. So a parent molecule aggregates the information of the children but grouped per warning_class.
  2. She has confirmed that there are only two warning types atm: withdrawn (manually curated) and Black_box_warning (previously manually curated + Fiona's new work). Note that the automated black box warning which assigns a black-box_warning flag and a toxicity class only applies to FDA drugs of phase 4. . We may have cases where the boxed warning is coming from the manual curation and that will not have a toxicity_class or a FDA reference.
  3. On the potential data issue of two different warning_classes sharing the same warning_refs and warning_description, this is correct and is due to the mapping to a MedDRA process described in the Table 1 of their publication.

So TEGASEROD is a withdrawn drug that has been manually curated with a description of multiple toxicity “Risk for heart attack, stroke, and unstable angina”. Each of these terms have been checked against MedDRA to see which high level categories they fall into. Heart attack and Angina are categorised within MedDRA ‘Cardiac disorders’ and Stroke is classed within MedDRA ‘Nervous System Disorders’. The high level MedDRA categories are then called Cardiotoxicity and Neurotoxicity for the purposes of the tox_classification.

Screenshot 2021-03-09 at 13 39 00

cmalangone commented 3 years ago

Chembl 28 is downloaded and available here: gs://open-targets-data-releases/21.04/input/annotation-files