Closed sentry-io[bot] closed 5 months ago
In fact – it never stops downloading the exact same data. I've stopped the spider and frozen the publication in the registry.
The spider is generating the links correctly, the issue is that the offset
parameter is not working in the record
endpoint, therefore the endpoint responds the first page always no matter the offset
parameter, e.g.
https://api.quienesquien.wiki/v3/record?limit=100&offset=0 and https://api.quienesquien.wiki/v3/record?limit=100&offset=100
return the same.
Should we document this as part of the spider docs or just delete it as it is not useful?
mexico_quien_es_quien_releases is working as expected but I see the releases endpoint returns a count of 10.000 always (e.g. https://api.quienesquien.wiki/v3/contracts?sort=date&sort_direction=desc&limit=1000&offset=1000), so I guess we can't get all the data with that endpoint either.
Should we delete both of them?
Let's report it to them, and then delete it (and maybe remove from registry, as it is quite incomplete).
The limitation is due to their configuration/use of Elasticsearch. We have guidance here that we can share: https://standard.open-contracting.org/latest/en/guidance/build/hosting/#completeness
Hmm, actually, for the /contracts endpoint, we can maybe do date searches to stay within the 10k limit (I think we do something similar for OpenOpps). https://qqwapi-elastic.readthedocs.io/es/latest/detalle/endpoints/#contracts
We would still need to delete the /record endpoint, as it is irrecoverably broken.
Can check this with:
IntegrityError: duplicate key value violates unique constraint "unique_record_identifiers"
Sentry Issue: REGISTRY-KINGFISHER-PROCESS-90