Update TEP in PIS/ETL/API

ireneisdoomed commented 3 years ago

TEPs are no longer manually curated, since 21.02 they are a result of scraping the table in thesgc.org/tep.

Thanks to #1738, I believe we are not picking the right TEPs input file in the ETL and the one in production (gs://open-targets-pre-data-releases/21.08.2/input/annotation-files/tep-2021-08-09.json) is based on the manual curation we used to make (and that was surely the source of the duplication mentioned in the issue).

When we ran it back then the process was to provide a multiline JSON that FE would directly ingest. It is my fault I did not communicate the change of the input file to BE.

In any case, that duplication does not exist in the new file. I have changed the output to a JSONL format so that it will be easily readable in the ETL.

Here's an example record:

-RECORD 0-----------------------------------------------------------------------------
 TEP_url     | https://www.thesgc.org/tep/slc12a4slc12a6
 description | Potassium/Chloride Co-transporter 1 and 3 (KCC1/KCC3; SLC12A4/SLC12A6)
 disease     | Sickle cell disease (SCD), Neurological
 gene_id     | ENSG00000124067
 symbol      | SLC12A4
 uniprot_id  | Q9UP95

I've created a bucket under otar001-core to store this annotation files. Here's the latest file: gs://otar001-core/TEPs/data_files/tep_output_2021-09-03.json.gz

I want to open the question of making this data more useful, as right now we are only linking out to the resource (ex. ALAS2). In the current dataset we produce, we could mimic SGC's page:

ireneisdoomed commented 3 years ago

We've touched on this in today's data team stand up and we've decided that:

we will be changing the schema
we want to expose the fields in the API

Proposed schema

root
 |-- gene_id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- therapeutic_area: string (nullable = true)
 |-- url: string (nullable = true)

Changes to the schema:

Given the TEP list for NUDT7, this is what we currently have:

v18.filter(~col('tep').isNull()).select('id', 'tep').first()

Row(id='ENSG00000140876', tep=Row(name='NUDT7', uri='https://www.thesgc.org/tep/nudt7'))

And this is how it should look like after the new changes are applied:

Row(id='ENSG00000140876', tep=Row(description='Human Peroxisomal Coenzyme A Diphosphatase NUDT7 (NUDT7)', therapeutic_area='Metabolic diseases', url='https://www.thesgc.org/tep/NUDT7'))

We'll leave the FE changes for later depending on our capacity.

JarrodBaker commented 3 years ago

[x] PIS
[x] ETL
[x] API

JarrodBaker commented 3 years ago

Thanks for the new file. I have renamed it gs://otar001-core/TEPs/data_files/tep-2021-09-03.json.gz to keep the formatting consistent with other entries read by PIS. If necessary @ireneisdoomed can you please update your end so that the files that get deposited in that bucket look like:

tep-{date}.json.gz

Instead of

tep_output_{date}.json.gz

JarrodBaker commented 3 years ago

In the Target object schema we have TEP listed as a struct, implying that there should only be one entry per ENSG ID. In the new input file there are multiple entries for a single ENSG ID:

-RECORD 2-----------------------------------------
 TEP_url     | https://www.thesgc.org/tep/KDM3B   
 description | Lysine Demethylase JMJD1B (KDM3B)  
 disease     | Cancer                             
 gene_id     | ENSG00000168356                    
 symbol      | SCN11A                             
 uniprot_id  | null                               
-RECORD 3-----------------------------------------
 TEP_url     | https://www.thesgc.org/tep/KDM4D   
 description | Lysine Demethylase JMJD2D (KDM4D)  
 disease     | Cancer                             
 gene_id     | ENSG00000168356                    
 symbol      | SCN11A                             
 uniprot_id  | null                               
-RECORD 4-----------------------------------------
 TEP_url     | https://www.thesgc.org/tep/PfBDP4  
 description | Plasmodium bormodomain PfBDP4      
 disease     | Malaria                            
 gene_id     | ENSG00000168356                    
 symbol      | SCN11A                             
 uniprot_id  | null

Is this intended @ireneisdoomed ?

JarrodBaker commented 3 years ago

@ireneisdoomed: the agreed solution (in consultation with @DSuveges ) is to change the Target schema to support multiple TEP entries per ENSG ID.

Depending on FE/BE capacity this might not be included until release 21.12.

mkarmona commented 3 years ago

After having a brief conversation about this issue and trying to boil it down to the root of the problem we came up with a set of actions and a new schema revision. The new schema looks

Raw TEP input schema for the ETL

 TEP_url: String
 disease: Option[Seq[String]]
 description: String                             
 targetFromSource: String

There should be one TEP per gene so the field symbol is expected to be unique across the dataset. Also, the combination of the fields (symbol, TEP_url) is also unique --- In some cases, it is expected to have the same TEP_url for more than one symbol though.

After the ETL process, the TEP should look

 TEP_url: String
 disease: Option[Seq[String]]                             
 targetFromSource: String
 description: String
 geneId: String

Actions:

[x] @DSuveges, a sanity check is expected before delivering the file out, for example, checking that the counts of the combined key (symbol, TEP_url) is not greater than 1. If that happens, then further iterations should occur.
[ ] @JarrodBaker, having the schema predefined, it is a matter of resolving the symbols with our target index to bring the geneId into the dataset. Few exceptions to manage here, please further discuss below.

Regarding the exceptions, one comes up to my mind like having assigned more than one gene for the same captured symbol. The usage of the approvedSymbol and also the HGNC-type synonyms to try to map to the symbol coming from the TEP dataset look sensible. The data comes from a scrapping process so applying some minor space trimming and case insensitive comparison to them look sensible too.

ireneisdoomed commented 3 years ago

This dataset now validates with the proposed new schema and can be found in the newly created bucket: gs://otar001-core/TEPs/data_files with the filename following the structure: tep-YYYY-MM-DD.json.gz

@mkarmona We've slightly changed the name of the fields above proposed to make them more consistent with the names of other schemas in the platform. Could you please make the necessary changes to update the following terms?:

TEP_url --> url
disease --> therapeuticArea
targetFromSource --> targetFromSourceId

mkarmona commented 3 years ago

@ireneisdoomed the strings coming into the field targetFromSourceId are not ids but rather labels. Can we change it back to targetFromSource?

ireneisdoomed commented 3 years ago

I see the rationale, but so far we are treating target acronyms as an id, even if it is not unique (which I guess should be the difference between a name and an id).

We treat the symbols in targetFromSourceId to differentiate them from any other name designating the target (which goes in targetFromSource).

Since this change should be propagated to other data models, @mkarmona and I have agreed on leaving the schema as it is.

opentargets / issues

Update TEP in PIS/ETL/API #1742

Proposed schema

Changes to the schema: