Closed fedde-s closed 9 months ago
Thanks for reporting this! Unfortunately I could reproduce this issue: the go term counts are consistent across the data exports:
(
spark.read.parquet('gs://open-targets-pre-data-releases/23.12/output/etl/parquet/targets')
.filter(f.col('id') == 'ENSG00000167207')
.select(f.size('go')).show()
) # 200
(
spark.read.json('gs://open-targets-pre-data-releases/23.12/output/etl/json/targets')
.filter(f.col('id') == 'ENSG00000167207')
.select(f.size('go')).show()
) # 200
Interestingly:
(
spark.read.parquet('gs://open-targets-pre-data-releases/23.09/output/etl/parquet/targets')
.filter(f.col('id') == 'ENSG00000167207')
.select(f.size('go')).show()
) # 204
We'll need to investigate why this could happen.
@jdhayhurst has refreshed the API cache and now the returned data is consistent with the files.
Describe the bug The arrays of Gene Ontology terms I found in the
"go":
field of https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.12/output/etl/json/targets/part-*.json
are sometimes shorter than the ones in the section of the corresponding targets' profile pages on https://platform.opentargets.org/. Is one of the two perhaps out of date?Observed behaviour One example I ran into: the GO section on the NOD2 profile page lists 204 associated terms, whereas the JSON files only list 200 for that EnsEMBL ID.
Expected behaviour I expected the data on the FTP server to match the data in the indices used to respond to the GraphQL query for this section.
To Reproduce Steps to reproduce the behaviour:
grep -Fh '"id":"ENSG00000167207"' targets/part-*.json | jq -c '.go[]' | wc -l
or use the folder along with other folders such asgo/
to host another installation of the Platform.Additional context NOD2 was not the only gene I found to be affected.