opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Column names included as data row in HPO Phenotypes dataset #3035

Open mbdebian opened 11 months ago

mbdebian commented 11 months ago

Describe the bug HPO Phenotypes dataset collected by PIS from here, contains a first data row with column names instead of data.

Observed behaviour PIS collects HPO Phenotype ontology data from here, to then convert the ontology into JSON format and filter it using JQ to produce the final output file within the ontology-inputs folder.

This file is in JSON lines format, and its schema looks like this

root
 |-- HPOId: string (nullable = true)
 |-- aspect: string (nullable = true)
 |-- biocuration: string (nullable = true)
 |-- databaseId: string (nullable = true)
 |-- diseaseName: string (nullable = true)
 |-- evidenceType: string (nullable = true)
 |-- frequency: string (nullable = true)
 |-- modifiers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- onset: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qualifier: string (nullable = true)
 |-- references: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- resource: string (nullable = true)
 |-- sex: string (nullable = true)

When looking at the content of the file, we can see that the first data row has been contaminated with the column/attributes names of the data schema:

@ diseasehpo.show 
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+
|     HPOId|aspect|         biocuration| databaseId|         diseaseName|evidenceType|frequency| modifiers|  onset|qualifier|     references|resource| sex|
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+
|    hpo_id|aspect|         biocuration|database_id|        disease_name|    evidence|frequency|[modifier]|[onset]|qualifier|    [reference]|     HPO| sex|
|HP:0011097|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002187|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/1|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001518|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0032792|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0011451|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0010851|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001789|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0200134|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001522|     C|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0000006|     I|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|     null|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002643|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002378|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0003324|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0002153|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0003394|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0001878|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|      NOT| [PMID:2766660]|     HPO|null|
|HP:0003768|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0000006|     I|HPO:skoehler[2017...|OMIM:609153|Pseudohyperkalemi...|         TAS|     null|      null|   null|     null|  [OMIM:609153]|     HPO|null|
|HP:0003621|     C| HPO:iea[2009-02-17]|OMIM:224550|Dystonia with rin...|         IEA|     null|      null|   null|     null|  [OMIM:224550]|     HPO|null|
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+

Expected behaviour That first row in the dataset, with the column names in a different format is not a data row and it should not be part of the dataset

To Reproduce Running PIS for disease step will produce this file at the ontology-inputs folder.

Additional context Although it doesn't seem to be affecting the ETL logic, as the data content of that row is unlikely to pair with anything else, this data bug should be fixed.

jdhayhurst commented 10 months ago

This appears to be a requested feature here after they had recently added the '#', but I can't see when it was removed again.

Here are a few options to handle it: 1) skip the first line (i.e. assume there is a header row and ignore it) 2) skip lines starting with "database_id" 3) refactor to make use of the header row i.e. get the column labels from the header, rather than use the column indices

@mbdebian do you have a preference/alternative? If not, I think (2) is the best approach because if they removed the header row or added the #, it still works (unlike 1 and 3), it's also unlikely that there'd be a case where we'd want to keep a row that started "database_id".