opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Updating project score parser to accomodate new schema #1826

Closed DSuveges closed 2 years ago

DSuveges commented 2 years ago

The disease/target json schema is being updated to capture cell line information better (diseaseCellLines field). To reflect this advancement, the parser of the project score data needs to be updated too.

For details of the json schema change see #1825

DSuveges commented 2 years ago

To get the enriched cell dataset, at first we have to create a new cell description file following these steps:

import requests
import pandas as pd

def uberon_lookup(label: str)-> str:
    '''Retrieving uberon identifier of a label from OLS, assuming perfect match.'''

    if not label:
        return None

    label = label.lower()
    url = f'https://www.ebi.ac.uk/ols/api/search?q={label}&queryFields=label&ontology=uberon&exact=true'

    # Parsing:
    try:
        data = requests.get(url).json()
        uberon_id = data['response']['docs'][0]['short_form']
        return uberon_id
    except IndexError:
        return None
    except KeyError:
        return None
    except ConnectionError:
        return None

# Loading cell description from project score:
crispr_cell_description = (
    pd.read_csv('crispr_cell_lines.tsv', sep='\t')
    .rename(columns={
        'Name': 'name',
        'Tissue': 'tissue',
        'Cancer Type': 'diseaseFromSource'
    })
)

print(f'Number of cell lines in the crispr cell lines: {len(crispr_cell_description)}')
print(f'Number of cell lines with no tissue: {len(crispr_cell_description.loc[crispr_cell_description.tissue.isna()])}')

# Extract unique list of tissues and map to uberon:
annotated_tissues = (
    crispr_cell_description
    [['tissue']]
    .drop_duplicates()
    .assign(tissueId = lambda df: df.tissue.apply(uberon_lookup))
)

print(f'Number of unique tissues: {len(annotated_tissues)}')
print(f'Number of tissues with no uberon mapping: {len(annotated_tissues.loc[annotated_tissues.tissueId.isna()])}')

# Fetching cell model data from Sanger:
cell_models =(
    pd.read_csv('https://cog.sanger.ac.uk/cmp/download/model_list_20210719.csv')
    [['model_id', 'model_name']]
    .rename(columns={
        'model_id': 'id',
        'model_name': 'name'
    })
    .drop_duplicates()
)
print(f'Number of cell models: {len(cell_models)}')

# Finalising and saving data:
crispr_cell_description = (
    crispr_cell_description

    # Merging uberon annotation and cell line identifiers with cell lines:
    .merge(annotated_tissues, on='tissue', how='left')

    # Merging cell line identifiers with cell lines:
    .merge(cell_models, on='name', how='left')

)

print(f'Number of cell lines at the end: {len(crispr_cell_description)}')
print(f'Number of cell lines with no identifier: {len(crispr_cell_description.loc[crispr_cell_description.id.isna()])}')
print(f'Number of cell lines with no uberon: {len(crispr_cell_description.loc[crispr_cell_description.tissueId.isna()])}')

# Saving enriched data file:
(
    crispr_cell_description
    .to_csv('crispr_cell_lines_enriched_2021-10-22.tsv', sep='\t', index=False)
)

What the output shows:

Number of cell lines in the crispr cell lines: 336
Number of cell lines with no tissue: 0
Number of unique tissues: 19
Number of tissues with no uberon mapping: 7
Number of cell models: 2007
Number of cell lines at the end: 336
Number of cell lines with no identifier: 0
Number of cell lines with no uberon: 69

The file now is a tsv with the followign header:

name tissue diseaseFromSource tissueId id
A375 Skin Melanoma SIDM00795
HCT-15 Large Intestine Colorectal Carcinoma UBERON_0000059 SIDM00789
HT-29 Large Intestine Colorectal Carcinoma UBERON_0000059 SIDM00136
HCC-78 Lung Lung Adenocarcinoma UBERON_0002048 SIDM01068
SW620 Large Intestine Colorectal Carcinoma UBERON_0000059 SIDM00841

The new file has been uploaded to the usual google bucket: gs://otar000-evidence_input/CRISPR/data_files

DSuveges commented 2 years ago

The update is done. The updated schema changed from :

root
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

To:

root
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- tissue: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

The counts in the updated dataset:

old (21.09) new (21.11)
file crispr-2021-09-07.json.gz crispr-2021-10-22
evidence 1846 1846
target 624 624
disease 19 19
association 1846 1846

The resulting dataset is uploaded to: gs://otar000-evidence_input/CRISPR/json/crispr-2021-10-22

This is how an example evidence looks like:

{
  "targetFromSourceId": "ENSG00000121879",
  "resourceScore": 73.75,
  "diseaseFromSource": "Breast Carcinoma",
  "diseaseCellLines": [
    {
      "name": "OCUB-M",
      "id": "SIDM00241",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1395",
      "id": "SIDM00884",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1143",
      "id": "SIDM00866",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-468",
      "id": "SIDM00628",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "COLO-824",
      "id": "SIDM00954",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-436",
      "id": "SIDM00629",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-415",
      "id": "SIDM00630",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MCF7",
      "id": "SIDM00148",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC70",
      "id": "SIDM00673",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "T47D",
      "id": "SIDM00097",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MFM-223",
      "id": "SIDM00332",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "EVSA-T",
      "id": "SIDM01042",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1954",
      "id": "SIDM00872",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "UACC-893",
      "id": "SIDM01186",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "JIMT-1",
      "id": "SIDM01037",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "Hs-578-T",
      "id": "SIDM00135",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-453",
      "id": "SIDM00272",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "AU565",
      "id": "SIDM00898",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-361",
      "id": "SIDM00528",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-231",
      "id": "SIDM00146",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC38",
      "id": "SIDM00675",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1937",
      "id": "SIDM00874",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1187",
      "id": "SIDM00885",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1806",
      "id": "SIDM00875",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "CAL-51",
      "id": "SIDM00933",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    }
  ],
  "diseaseFromSourceMappedId": "EFO_0000305",
  "datasourceId": "crispr",
  "datatypeId": "affected_pathway"
}

This evidence validates against the most recent schema.

ireneisdoomed commented 2 years ago
Regarding the enriched cell dataset: given the low number of tissues to map to UBERON, I propose to manually curate those which weren't mapped programmatically. What do you think? @DSuveges tissue tissueId uberonLabel
Skin UBERON_0002097 skin of body
Prostate UBERON_0002367 prostate gland
Head and Neck UBERON_0000033 head
Head and Neck UBERON_0000974 neck
Bone UBERON_0002481 bone tissue
Biliary Tract UBERON_0001173 biliary tree
Haematopoietic and Lymphoid UBERON_0002390 hematopoietic system
Haematopoietic and Lymphoid UBERON_0001744 lymphoid tissue
Soft Tissue UBERON_0002385 muscle tissue
Soft Tissue UBERON_0000043 tendon
Soft Tissue UBERON_0000211 ligament
Soft Tissue UBERON_0001013 adipose tissue
Soft Tissue UBERON_0011824 fibrous connective tissue
Soft Tissue UBERON_0002391 lymph
Soft Tissue UBERON_0001981 blood vessel
Soft Tissue UBERON_0008982 fascia
DSuveges commented 2 years ago

@ireneisdoomed You are right, I'm updating the cell info .tsv file. Will generate a new file with the same name, so no update is required for the snakefile.

ireneisdoomed commented 2 years ago

@DSuveges Great! I hope the multiple UBERONs do not cause an illegible explosion in the diseaseCellLines object...