opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Targets on FTP are sometimes annotated with fewer GO terms than on the production Platform #3177

Closed fedde-s closed 9 months ago

fedde-s commented 9 months ago

Describe the bug The arrays of Gene Ontology terms I found in the "go": field of https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.12/output/etl/json/targets/part-*.json are sometimes shorter than the ones in the section of the corresponding targets' profile pages on https://platform.opentargets.org/. Is one of the two perhaps out of date?

Observed behaviour One example I ran into: the GO section on the NOD2 profile page lists 204 associated terms, whereas the JSON files only list 200 for that EnsEMBL ID.

Expected behaviour I expected the data on the FTP server to match the data in the indices used to respond to the GraphQL query for this section.

To Reproduce Steps to reproduce the behaviour:

  1. Go to the profile page linked above.
  2. Scroll down to Gene Ontology
  3. See the number 204 in the bottom-right corner of the table.
  4. Download the targets folder using a command listed on the downloads page or another FTP client
  5. Run grep -Fh '"id":"ENSG00000167207"' targets/part-*.json | jq -c '.go[]' | wc -l or use the folder along with other folders such as go/ to host another installation of the Platform.
  6. See the number 200 instead.

Additional context NOD2 was not the only gene I found to be affected.

DSuveges commented 9 months ago

Thanks for reporting this! Unfortunately I could reproduce this issue: the go term counts are consistent across the data exports:

(
    spark.read.parquet('gs://open-targets-pre-data-releases/23.12/output/etl/parquet/targets')
    .filter(f.col('id') == 'ENSG00000167207')
    .select(f.size('go')).show()
) # 200

(
    spark.read.json('gs://open-targets-pre-data-releases/23.12/output/etl/json/targets')
    .filter(f.col('id') == 'ENSG00000167207')
    .select(f.size('go')).show()
) # 200

Interestingly:

(
    spark.read.parquet('gs://open-targets-pre-data-releases/23.09/output/etl/parquet/targets')
    .filter(f.col('id') == 'ENSG00000167207')
    .select(f.size('go')).show()
) # 204

We'll need to investigate why this could happen.

DSuveges commented 9 months ago

@jdhayhurst has refreshed the API cache and now the returned data is consistent with the files.