Import API should validate input

cdrini commented 2 years ago

Want the import api to validate provided JSON, and reject any invalid documents.

Plan

One of the issues: currently the only json schema we have for this API resides in openlibrary-client: https://github.com/internetarchive/openlibrary-client/blob/99e600136f033a034690f947e3f09d930455c2c9/olclient/schemata/import.schema.json . We would like to avoid duplicating this if possible. Note longer term, these json schema files should be auto-generated by the server from the infogami type, since that is the actual ground truth. (eg https://openlibrary.org/type/work). Note this doesn't apply to the import api (?).

[x] #5781 Determine validation library to use + integrate
[x] Add identifiers field to import schema on openlibrary-client (https://github.com/internetarchive/openlibrary-client/pull/272)
[ ] Generate the pydantic class from the schema file
- https://github.com/koxudaxi/datamodel-code-generator/
[ ] Import json schema from olclient into OL
- This is a little awkward, but the only we can import this right now to avoid duplicating.
[ ] Add a test case to ensure generated pydantic classes are up-to-date
- Same model as the solr_types_generator.py which generates a SolrDocument class

Additional context

Stakeholders

@hornc @cdrini @mekarpeles @jimchamp

mekarpeles commented 2 years ago

https://github.com/internetarchive/openlibrary-client/pull/272

jimchamp commented 2 years ago

In order to consider this issue closed, the following must be true:

If our import process fails for any reason, the team should be notified within an hour.
If any item fails to be imported, the actual reason needs to be logged somewhere (maybe sentry?) and a more precise error message should be included in the /admin/imports table.

Even if Pydantic is giving us validation error messages that are detailed enough to correct invalid records, the records can't be corrected if the error messages are not being surfaced in some way. Similarly, when looking at the /admin/imports table, it's impossible to tell what went wrong if nearly all of the recorded errors are internal-error or unknown-error.

We can uses statuses like invalid-data or missing-data when an import item fails validation. Other statuses that reflect what actually happened can also be added (infobase_offline, for example). As a catch-all unknown-error can be used, but we should strive to identify what the error is and use a more illuminating status for those cases.

One thing that we will have to figure out is how to determine when the import process needs human intervention, and how to notify said humans. Here are some ideas that Mek and I discussed today:

Every 10 minutes or so, check if there are pending import items and 10 or more recent internal-errors. Alert staff if true.
Code already exists that can be used to send emails to staff members.
We could trigger a Nagios health check failure by setting a flag when major import pipeline failures have been detected.
Imports are separated into batches of up to 1000 items. We could keep a count of import statuses for each batch, and notify staff if some threshold of unknown-errors or internal-errors have occurred.

SvanteRichter commented 2 years ago

I happened upon a \u0000 char in a book title that broke the UI for both viewing and editing, I sent in a support request and it was fixed and I was pointed to this issue. In the support request I stated that this led to invalid JSON in the data dumps, but upon reading the JSON RFC I realized that it is actually valid JSON, it's just not supported by postgresql (which I use for importing the data dumps). My question is if perhaps the validator can (even if it is valid json) disallow null bytes (or rather the JSON escaped representation of them) in text data? To me there seems to be no valid usecase to allow null bytes, but it may cause issues with both the openarchive.org site and for consumers of the data dumps like me. Let me know if this belongs in a separate issue instead.

internetarchive / openlibrary