Closed ireneisdoomed closed 2 years ago
We've touched on this in today's data team stand up and we've decided that:
root
|-- gene_id: string (nullable = true)
|-- description: string (nullable = true)
|-- therapeutic_area: string (nullable = true)
|-- url: string (nullable = true)
Given the TEP list for NUDT7, this is what we currently have:
v18.filter(~col('tep').isNull()).select('id', 'tep').first()
Row(id='ENSG00000140876', tep=Row(name='NUDT7', uri='https://www.thesgc.org/tep/nudt7'))
And this is how it should look like after the new changes are applied:
Row(id='ENSG00000140876', tep=Row(description='Human Peroxisomal Coenzyme A Diphosphatase NUDT7 (NUDT7)', therapeutic_area='Metabolic diseases', url='https://www.thesgc.org/tep/NUDT7'))
We'll leave the FE changes for later depending on our capacity.
Thanks for the new file. I have renamed it gs://otar001-core/TEPs/data_files/tep-2021-09-03.json.gz
to keep the formatting consistent with other entries read by PIS. If necessary @ireneisdoomed can you please update your end so that the files that get deposited in that bucket look like:
tep-{date}.json.gz
Instead of
tep_output_{date}.json.gz
In the Target object schema we have TEP listed as a struct
, implying that there should only be one entry per ENSG ID. In the new input file there are multiple entries for a single ENSG ID:
-RECORD 2-----------------------------------------
TEP_url | https://www.thesgc.org/tep/KDM3B
description | Lysine Demethylase JMJD1B (KDM3B)
disease | Cancer
gene_id | ENSG00000168356
symbol | SCN11A
uniprot_id | null
-RECORD 3-----------------------------------------
TEP_url | https://www.thesgc.org/tep/KDM4D
description | Lysine Demethylase JMJD2D (KDM4D)
disease | Cancer
gene_id | ENSG00000168356
symbol | SCN11A
uniprot_id | null
-RECORD 4-----------------------------------------
TEP_url | https://www.thesgc.org/tep/PfBDP4
description | Plasmodium bormodomain PfBDP4
disease | Malaria
gene_id | ENSG00000168356
symbol | SCN11A
uniprot_id | null
Is this intended @ireneisdoomed ?
@ireneisdoomed: the agreed solution (in consultation with @DSuveges ) is to change the Target schema to support multiple TEP entries per ENSG ID.
Depending on FE/BE capacity this might not be included until release 21.12.
After having a brief conversation about this issue and trying to boil it down to the root of the problem we came up with a set of actions and a new schema revision. The new schema looks
Raw TEP input schema for the ETL
TEP_url: String
disease: Option[Seq[String]]
description: String
targetFromSource: String
There should be one TEP per gene so the field symbol
is expected to be unique across the dataset. Also, the combination of the fields (symbol, TEP_url)
is also unique --- In some cases, it is expected to have the same TEP_url
for more than one symbol though.
After the ETL process, the TEP should look
TEP_url: String
disease: Option[Seq[String]]
targetFromSource: String
description: String
geneId: String
Actions:
(symbol, TEP_url)
is not greater than 1. If that happens, then further iterations should occur.Regarding the exceptions, one comes up to my mind like having assigned more than one gene for the same captured symbol. The usage of the approvedSymbol
and also the HGNC-type synonyms to try to map to the symbol coming from the TEP dataset look sensible. The data comes from a scrapping process so applying some minor space trimming and case insensitive comparison to them look sensible too.
This dataset now validates with the proposed new schema and can be found in the newly created bucket:
gs://otar001-core/TEPs/data_files
with the filename following the structure: tep-YYYY-MM-DD.json.gz
@mkarmona We've slightly changed the name of the fields above proposed to make them more consistent with the names of other schemas in the platform. Could you please make the necessary changes to update the following terms?:
TEP_url --> url
disease --> therapeuticArea
targetFromSource --> targetFromSourceId
@ireneisdoomed the strings coming into the field targetFromSourceId
are not ids but rather labels. Can we change it back to targetFromSource
?
I see the rationale, but so far we are treating target acronyms as an id, even if it is not unique (which I guess should be the difference between a name and an id).
We treat the symbols in targetFromSourceId
to differentiate them from any other name designating the target (which goes in targetFromSource
).
Since this change should be propagated to other data models, @mkarmona and I have agreed on leaving the schema as it is.
TEPs are no longer manually curated, since 21.02 they are a result of scraping the table in
thesgc.org/tep
.Thanks to #1738, I believe we are not picking the right TEPs input file in the ETL and the one in production (
gs://open-targets-pre-data-releases/21.08.2/input/annotation-files/tep-2021-08-09.json
) is based on the manual curation we used to make (and that was surely the source of the duplication mentioned in the issue).When we ran it back then the process was to provide a multiline JSON that FE would directly ingest. It is my fault I did not communicate the change of the input file to BE.
In any case, that duplication does not exist in the new file. I have changed the output to a JSONL format so that it will be easily readable in the ETL.
Here's an example record:
I've created a bucket under
otar001-core
to store this annotation files. Here's the latest file:gs://otar001-core/TEPs/data_files/tep_output_2021-09-03.json.gz
I want to open the question of making this data more useful, as right now we are only linking out to the resource (ex. ALAS2). In the current dataset we produce, we could mimic SGC's page: