Updating project score parser to accomodate new schema

DSuveges commented 2 years ago

The disease/target json schema is being updated to capture cell line information better (diseaseCellLines field). To reflect this advancement, the parser of the project score data needs to be updated too.

For details of the json schema change see #1825

DSuveges commented 2 years ago

To get the enriched cell dataset, at first we have to create a new cell description file following these steps:

import requests
import pandas as pd

def uberon_lookup(label: str)-> str:
    '''Retrieving uberon identifier of a label from OLS, assuming perfect match.'''

    if not label:
        return None

    label = label.lower()
    url = f'https://www.ebi.ac.uk/ols/api/search?q={label}&queryFields=label&ontology=uberon&exact=true'

    # Parsing:
    try:
        data = requests.get(url).json()
        uberon_id = data['response']['docs'][0]['short_form']
        return uberon_id
    except IndexError:
        return None
    except KeyError:
        return None
    except ConnectionError:
        return None

# Loading cell description from project score:
crispr_cell_description = (
    pd.read_csv('crispr_cell_lines.tsv', sep='\t')
    .rename(columns={
        'Name': 'name',
        'Tissue': 'tissue',
        'Cancer Type': 'diseaseFromSource'
    })
)

print(f'Number of cell lines in the crispr cell lines: {len(crispr_cell_description)}')
print(f'Number of cell lines with no tissue: {len(crispr_cell_description.loc[crispr_cell_description.tissue.isna()])}')

# Extract unique list of tissues and map to uberon:
annotated_tissues = (
    crispr_cell_description
    [['tissue']]
    .drop_duplicates()
    .assign(tissueId = lambda df: df.tissue.apply(uberon_lookup))
)

print(f'Number of unique tissues: {len(annotated_tissues)}')
print(f'Number of tissues with no uberon mapping: {len(annotated_tissues.loc[annotated_tissues.tissueId.isna()])}')

# Fetching cell model data from Sanger:
cell_models =(
    pd.read_csv('https://cog.sanger.ac.uk/cmp/download/model_list_20210719.csv')
    [['model_id', 'model_name']]
    .rename(columns={
        'model_id': 'id',
        'model_name': 'name'
    })
    .drop_duplicates()
)
print(f'Number of cell models: {len(cell_models)}')

# Finalising and saving data:
crispr_cell_description = (
    crispr_cell_description

    # Merging uberon annotation and cell line identifiers with cell lines:
    .merge(annotated_tissues, on='tissue', how='left')

    # Merging cell line identifiers with cell lines:
    .merge(cell_models, on='name', how='left')

)

print(f'Number of cell lines at the end: {len(crispr_cell_description)}')
print(f'Number of cell lines with no identifier: {len(crispr_cell_description.loc[crispr_cell_description.id.isna()])}')
print(f'Number of cell lines with no uberon: {len(crispr_cell_description.loc[crispr_cell_description.tissueId.isna()])}')

# Saving enriched data file:
(
    crispr_cell_description
    .to_csv('crispr_cell_lines_enriched_2021-10-22.tsv', sep='\t', index=False)
)

What the output shows:

Number of cell lines in the crispr cell lines: 336
Number of cell lines with no tissue: 0
Number of unique tissues: 19
Number of tissues with no uberon mapping: 7
Number of cell models: 2007
Number of cell lines at the end: 336
Number of cell lines with no identifier: 0
Number of cell lines with no uberon: 69

The file now is a tsv with the followign header:

name	tissue	diseaseFromSource	tissueId	id
A375	Skin	Melanoma		SIDM00795
HCT-15	Large Intestine	Colorectal Carcinoma	UBERON_0000059	SIDM00789
HT-29	Large Intestine	Colorectal Carcinoma	UBERON_0000059	SIDM00136
HCC-78	Lung	Lung Adenocarcinoma	UBERON_0002048	SIDM01068
SW620	Large Intestine	Colorectal Carcinoma	UBERON_0000059	SIDM00841

The new file has been uploaded to the usual google bucket: gs://otar000-evidence_input/CRISPR/data_files

DSuveges commented 2 years ago

The update is done. The updated schema changed from :

root
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

To:

root
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseCellLines: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- tissue: string (nullable = true)
 |    |    |-- tissueId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- resourceScore: double (nullable = true)
 |-- targetFromSourceId: string (nullable = true)

The counts in the updated dataset:

	old (21.09)	new (21.11)
file	crispr-2021-09-07.json.gz	crispr-2021-10-22
evidence	1846	1846
target	624	624
disease	19	19
association	1846	1846

The resulting dataset is uploaded to: gs://otar000-evidence_input/CRISPR/json/crispr-2021-10-22

This is how an example evidence looks like:

{
  "targetFromSourceId": "ENSG00000121879",
  "resourceScore": 73.75,
  "diseaseFromSource": "Breast Carcinoma",
  "diseaseCellLines": [
    {
      "name": "OCUB-M",
      "id": "SIDM00241",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1395",
      "id": "SIDM00884",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1143",
      "id": "SIDM00866",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-468",
      "id": "SIDM00628",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "COLO-824",
      "id": "SIDM00954",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-436",
      "id": "SIDM00629",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-415",
      "id": "SIDM00630",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MCF7",
      "id": "SIDM00148",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC70",
      "id": "SIDM00673",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "T47D",
      "id": "SIDM00097",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MFM-223",
      "id": "SIDM00332",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "EVSA-T",
      "id": "SIDM01042",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1954",
      "id": "SIDM00872",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "UACC-893",
      "id": "SIDM01186",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "JIMT-1",
      "id": "SIDM01037",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "Hs-578-T",
      "id": "SIDM00135",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-453",
      "id": "SIDM00272",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "AU565",
      "id": "SIDM00898",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-361",
      "id": "SIDM00528",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "MDA-MB-231",
      "id": "SIDM00146",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC38",
      "id": "SIDM00675",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1937",
      "id": "SIDM00874",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1187",
      "id": "SIDM00885",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "HCC1806",
      "id": "SIDM00875",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    },
    {
      "name": "CAL-51",
      "id": "SIDM00933",
      "tissue": "Breast",
      "tissueId": "UBERON_0000310"
    }
  ],
  "diseaseFromSourceMappedId": "EFO_0000305",
  "datasourceId": "crispr",
  "datatypeId": "affected_pathway"
}

This evidence validates against the most recent schema.

ireneisdoomed commented 2 years ago

Regarding the enriched cell dataset: given the low number of tissues to map to UBERON, I propose to manually curate those which weren't mapped programmatically. What do you think? @DSuveges	tissue	tissueId
Skin	UBERON_0002097	skin of body
Prostate	UBERON_0002367	prostate gland
Head and Neck	UBERON_0000033	head
Head and Neck	UBERON_0000974	neck
Bone	UBERON_0002481	bone tissue
Biliary Tract	UBERON_0001173	biliary tree
Haematopoietic and Lymphoid	UBERON_0002390	hematopoietic system
Haematopoietic and Lymphoid	UBERON_0001744	lymphoid tissue
Soft Tissue	UBERON_0002385	muscle tissue
Soft Tissue	UBERON_0000043	tendon
Soft Tissue	UBERON_0000211	ligament
Soft Tissue	UBERON_0001013	adipose tissue
Soft Tissue	UBERON_0011824	fibrous connective tissue
Soft Tissue	UBERON_0002391	lymph
Soft Tissue	UBERON_0001981	blood vessel
Soft Tissue	UBERON_0008982	fascia

DSuveges commented 2 years ago

@ireneisdoomed You are right, I'm updating the cell info .tsv file. Will generate a new file with the same name, so no update is required for the snakefile.

ireneisdoomed commented 2 years ago

@DSuveges Great! I hope the multiple UBERONs do not cause an illegible explosion in the diseaseCellLines object...

opentargets / issues

Updating project score parser to accomodate new schema #1826