internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.16k stars 1.35k forks source link

Import API should validate input #5833

Open cdrini opened 2 years ago

cdrini commented 2 years ago

Want the import api to validate provided JSON, and reject any invalid documents.

Plan

One of the issues: currently the only json schema we have for this API resides in openlibrary-client: https://github.com/internetarchive/openlibrary-client/blob/99e600136f033a034690f947e3f09d930455c2c9/olclient/schemata/import.schema.json . We would like to avoid duplicating this if possible. Note longer term, these json schema files should be auto-generated by the server from the infogami type, since that is the actual ground truth. (eg https://openlibrary.org/type/work). Note this doesn't apply to the import api (?).

Additional context

Stakeholders

@hornc @cdrini @mekarpeles @jimchamp

mekarpeles commented 2 years ago

https://github.com/internetarchive/openlibrary-client/pull/272

jimchamp commented 2 years ago

In order to consider this issue closed, the following must be true:

  1. If our import process fails for any reason, the team should be notified within an hour.
  2. If any item fails to be imported, the actual reason needs to be logged somewhere (maybe sentry?) and a more precise error message should be included in the /admin/imports table.

Even if Pydantic is giving us validation error messages that are detailed enough to correct invalid records, the records can't be corrected if the error messages are not being surfaced in some way. Similarly, when looking at the /admin/imports table, it's impossible to tell what went wrong if nearly all of the recorded errors are internal-error or unknown-error.

We can uses statuses like invalid-data or missing-data when an import item fails validation. Other statuses that reflect what actually happened can also be added (infobase_offline, for example). As a catch-all unknown-error can be used, but we should strive to identify what the error is and use a more illuminating status for those cases.

One thing that we will have to figure out is how to determine when the import process needs human intervention, and how to notify said humans. Here are some ideas that Mek and I discussed today:

SvanteRichter commented 2 years ago

I happened upon a \u0000 char in a book title that broke the UI for both viewing and editing, I sent in a support request and it was fixed and I was pointed to this issue. In the support request I stated that this led to invalid JSON in the data dumps, but upon reading the JSON RFC I realized that it is actually valid JSON, it's just not supported by postgresql (which I use for importing the data dumps). My question is if perhaps the validator can (even if it is valid json) disallow null bytes (or rather the JSON escaped representation of them) in text data? To me there seems to be no valid usecase to allow null bytes, but it may cause issues with both the openarchive.org site and for consumers of the data dumps like me. Let me know if this belongs in a separate issue instead.