Open mbdebian opened 11 months ago
This appears to be a requested feature here after they had recently added the '#', but I can't see when it was removed again.
Here are a few options to handle it: 1) skip the first line (i.e. assume there is a header row and ignore it) 2) skip lines starting with "database_id" 3) refactor to make use of the header row i.e. get the column labels from the header, rather than use the column indices
@mbdebian do you have a preference/alternative? If not, I think (2) is the best approach because if they removed the header row or added the #, it still works (unlike 1 and 3), it's also unlikely that there'd be a case where we'd want to keep a row that started "database_id".
Describe the bug HPO Phenotypes dataset collected by PIS from here, contains a first data row with column names instead of data.
Observed behaviour PIS collects HPO Phenotype ontology data from here, to then convert the ontology into JSON format and filter it using JQ to produce the final output file within the ontology-inputs folder.
This file is in JSON lines format, and its schema looks like this
When looking at the content of the file, we can see that the first data row has been contaminated with the column/attributes names of the data schema:
Expected behaviour That first row in the dataset, with the column names in a different format is not a data row and it should not be part of the dataset
To Reproduce Running PIS for disease step will produce this file at the ontology-inputs folder.
Additional context Although it doesn't seem to be affecting the ETL logic, as the data content of that row is unlikely to pair with anything else, this data bug should be fixed.