open-contracting / kingfisher-process

Stores and pre-processes OCDS data in a SQL database
https://kingfisher-process.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
2 stars 8 forks source link

DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type json (1e9999) #323

Open sentry-io[bot] opened 3 years ago

sentry-io[bot] commented 3 years ago

Sentry Issue: KINGFISHER-PROCESS-7X

InvalidTextRepresentation: invalid input syntax for type json
LINE 1: ...data) VALUES ('fe8a9342ae1a38e6ef3c574e69a8d21f', '{"ocid": ...
                                                             ^
DETAIL:  Token "Infinity" is invalid.
CONTEXT:  JSON data, line 1: ...tDetails": {"maximumLotsBidPerSupplier": Infinity...

  File "sqlalchemy/engine/base.py", line 1236, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/engine/default.py", line 536, in do_execute
    cursor.execute(statement, parameters)

DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type json
LINE 1: ...data) VALUES ('fe8a9342ae1a38e6ef3c574e69a8d21f', '{"ocid": ...
                                                             ^
DETAIL:  Token "Infinity" is invalid.
CONTEXT:  JSON data, line 1: ...tDetails": {"maximumLotsBidPerSupplier": Infinity...
 [SQL: 'INSERT INTO data (hash_md5, data) VALUES (%(hash_md5)s, %(data)s) RETURNING data.id'] [parameters: {'hash_md5': 'fe8a9342ae1a38e6ef3c574e69a8d21f', 'data': '{"oc...
(18 additional frame(s) were not displayed)
...
  File "sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "sqlalchemy/util/compat.py", line 296, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "sqlalchemy/util/compat.py", line 276, in reraise
    raise value.with_traceback(tb)
  File "sqlalchemy/engine/base.py", line 1236, in _execute_context
    cursor, statement, parameters, context
  File "sqlalchemy/engine/default.py", line 536, in do_execute
    cursor.execute(statement, parameters)

This happens with the UK spider, eg https://www.find-tender.service.gov.uk/api/1.0/ocdsReleasePackages?limit=100&updatedTo=2021-01-13T16:21:39&cursor=dXBkYXRlZFRvPTIwMjEtMDEtMTNUMTY6MjE6Mzl8bmV4dEN1cnNvcj0xNTAxMzM=

cc @aguilerapy @duncandewhurst

jpmckinney commented 3 years ago

Ah, yes, per our guidance, unlimited maximum lots is expressed as 1e9999, which gets parsed as Infinity. However, SQLAlchemy doesn't know how to serialize Infinity to JSON.

Ideally, we would update the SQL adapter. A quick fix would instead replace occurrences of 1e9999 with a very large 32-bit integer, so that PostgreSQL can store it, without changing the meaning of the data.

FYI @jakubkrafka as an edge case for the Django version.

duncandewhurst commented 3 years ago

A quick fix would instead replace occurrences of 1e9999 with a very large 32-bit integer, so that PostgreSQL can store it, without changing the meaning of the data.

@jpmckinney is that something we can do in the Kingfisher Collect spider? Or do you mean do a manual find and replace and then local load the data?

jpmckinney commented 3 years ago

It can be done in either Kingfisher Collect or Kingfisher Process. Maybe better in Process, since 1e9999 is allowed in JSON - it's just that Process currently stumbles on it. But in a pinch an analyst can also do the manual find-and-replace.

duncandewhurst commented 3 years ago

Is it worth fixing in Kingfisher Process before the rewrite? The next mini-sprint is scheduled for w/c 9th February, so it may be better to apply the fix in the scraper as we may run several scrapes before then if the publisher is responsive to feedback and it would be good to save analyst time on applying the fix manually.

jpmckinney commented 3 years ago

Sure, we can have a temporary fix in Kingfisher Collect (cc @yolile), and also open an issue in Kingfisher Collect to remove that fix once this issue is closed in the new version of Kingfisher Process.

yolile commented 3 years ago

@duncandewhurst I fixed the issue in collect and ran the spider successfully now

jpmckinney commented 4 months ago

Not a priority as it only affects https://github.com/open-contracting/kingfisher-collect/blob/main/kingfisher_scrapy/spiders/united_kingdom_fts.py#L48 and the data modification doesn’t impact correct data analysis.