open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

mexico_quien_es_quien_records: Downloads the same file over 2000 times #1054

Closed sentry-io[bot] closed 5 months ago

sentry-io[bot] commented 7 months ago

Can check this with:

md5sum /data/storage/kingfisher-collect/mexico_quien_es_quien_records/20240207_000156/**/*

IntegrityError: duplicate key value violates unique constraint "unique_record_identifiers"

Sentry Issue: REGISTRY-KINGFISHER-PROCESS-90

UniqueViolation: duplicate key value violates unique constraint "unique_record_identifiers"
DETAIL:  Key (collection_id, ocid)=(1984, ocds-0ud2q6-AA-019GYR069-E374-2017) already exists.

  File "django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)

IntegrityError: duplicate key value violates unique constraint "unique_record_identifiers"
DETAIL:  Key (collection_id, ocid)=(1984, ocds-0ud2q6-AA-019GYR069-E374-2017) already exists.

(11 additional frame(s) were not displayed)
...
  File "process/management/commands/file_worker.py", line 81, in callback
    upgraded_collection_file_id = process_file(collection_file)
  File "process/management/commands/file_worker.py", line 136, in process_file
    _store_data(collection_file, package, releases_or_records, data_type, upgrade=False)
  File "process/management/commands/file_worker.py", line 276, in _store_data
    ).save()

IntegrityError maybe caused by duplicate message "{\"collection_id\":1984,\"collection_file_id\":248670680}", skipping
jpmckinney commented 7 months ago

In fact – it never stops downloading the exact same data. I've stopped the spider and frozen the publication in the registry.

yolile commented 5 months ago

The spider is generating the links correctly, the issue is that the offset parameter is not working in the record endpoint, therefore the endpoint responds the first page always no matter the offset parameter, e.g.

https://api.quienesquien.wiki/v3/record?limit=100&offset=0 and https://api.quienesquien.wiki/v3/record?limit=100&offset=100

return the same.

Should we document this as part of the spider docs or just delete it as it is not useful?

mexico_quien_es_quien_releases is working as expected but I see the releases endpoint returns a count of 10.000 always (e.g. https://api.quienesquien.wiki/v3/contracts?sort=date&sort_direction=desc&limit=1000&offset=1000), so I guess we can't get all the data with that endpoint either.

Should we delete both of them?

jpmckinney commented 5 months ago

Let's report it to them, and then delete it (and maybe remove from registry, as it is quite incomplete).

The limitation is due to their configuration/use of Elasticsearch. We have guidance here that we can share: https://standard.open-contracting.org/latest/en/guidance/build/hosting/#completeness

jpmckinney commented 5 months ago

Hmm, actually, for the /contracts endpoint, we can maybe do date searches to stay within the 10k limit (I think we do something similar for OpenOpps). https://qqwapi-elastic.readthedocs.io/es/latest/detalle/endpoints/#contracts

We would still need to delete the /record endpoint, as it is irrecoverably broken.