opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Molecule dataset for release 24.06 has multiple Unknown values: inconsistent casing #3354

Closed dhimmel closed 1 month ago

dhimmel commented 3 months ago

@related-sciences appreciates all the great work by the OT team and noticed something small when upgrading to 24.06.

Running the following on BigQuery, which currently is based on the 24.06 release:

SELECT
  drugType,
  COUNT(drugType) AS drugTypeCount,
FROM
  `open-targets-prod.platform.molecule`
GROUP BY drugType
ORDER BY drugType

Produces the following table of drugType counts:

drugType drugTypeCount
Antibody 963
Antibody drug conjugate 119
Cell 52
Enzyme 91
Gene 117
Oligonucleotide 159
Oligosaccharide 52
Protein 741
Small molecule 14854
Unknown 870
unknown 23

Notice the mixed casing for the "Unknown" / "unknown" value. This issue also exists in 24.03 although prior releases have been consistent in only using "Unknown".

Versioned GCS path gs://open-targets-data-releases/24.06/output/etl/parquet/molecule.

There's the narrow fix and then possibly a broader fix of selecting possible values from an enum or applying a schema that would prevent an issue like this from ever occurring.

prashantuniyal02 commented 1 month ago

Hi @dhimmel, this issue has been resolved. The changes will be released in the upcoming 24.09 platform release with the following drugType:

drugType
Antibody
Antibody drug conjugate
Cell
Enzyme
Gene
Oligonucleotide
Oligosaccharide
Protein
Small molecule
Unknown