Column names included as data row in HPO Phenotypes dataset

Describe the bug HPO Phenotypes dataset collected by PIS from here, contains a first data row with column names instead of data.

Observed behaviour PIS collects HPO Phenotype ontology data from here, to then convert the ontology into JSON format and filter it using JQ to produce the final output file within the ontology-inputs folder.

This file is in JSON lines format, and its schema looks like this

root
 |-- HPOId: string (nullable = true)
 |-- aspect: string (nullable = true)
 |-- biocuration: string (nullable = true)
 |-- databaseId: string (nullable = true)
 |-- diseaseName: string (nullable = true)
 |-- evidenceType: string (nullable = true)
 |-- frequency: string (nullable = true)
 |-- modifiers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- onset: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- qualifier: string (nullable = true)
 |-- references: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- resource: string (nullable = true)
 |-- sex: string (nullable = true)

When looking at the content of the file, we can see that the first data row has been contaminated with the column/attributes names of the data schema:

@ diseasehpo.show 
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+
|     HPOId|aspect|         biocuration| databaseId|         diseaseName|evidenceType|frequency| modifiers|  onset|qualifier|     references|resource| sex|
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+
|    hpo_id|aspect|         biocuration|database_id|        disease_name|    evidence|frequency|[modifier]|[onset]|qualifier|    [reference]|     HPO| sex|
|HP:0011097|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002187|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/1|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001518|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0032792|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0011451|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0010851|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001789|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0200134|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0001522|     C|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      1/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0000006|     I|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|     null|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002643|     P|HPO:probinson[202...|OMIM:619340|Developmental and...|         PCS|      2/2|      null|   null|     null|[PMID:31675180]|     HPO|null|
|HP:0002378|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0003324|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0002153|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0003394|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0001878|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|      NOT| [PMID:2766660]|     HPO|null|
|HP:0003768|     P|HPO:lccarmody[201...|OMIM:609153|Pseudohyperkalemi...|         PCS|     null|      null|   null|     null| [PMID:2766660]|     HPO|null|
|HP:0000006|     I|HPO:skoehler[2017...|OMIM:609153|Pseudohyperkalemi...|         TAS|     null|      null|   null|     null|  [OMIM:609153]|     HPO|null|
|HP:0003621|     C| HPO:iea[2009-02-17]|OMIM:224550|Dystonia with rin...|         IEA|     null|      null|   null|     null|  [OMIM:224550]|     HPO|null|
+----------+------+--------------------+-----------+--------------------+------------+---------+----------+-------+---------+---------------+--------+----+

Expected behaviour That first row in the dataset, with the column names in a different format is not a data row and it should not be part of the dataset

To Reproduce Running PIS for disease step will produce this file at the ontology-inputs folder.

Additional context Although it doesn't seem to be affecting the ETL logic, as the data content of that row is unlikely to pair with anything else, this data bug should be fixed.

opentargets / issues

Column names included as data row in HPO Phenotypes dataset #3035