Open cdrini opened 2 years ago
In order to consider this issue closed, the following must be true:
/admin/imports
table.Even if Pydantic is giving us validation error messages that are detailed enough to correct invalid records, the records can't be corrected if the error messages are not being surfaced in some way. Similarly, when looking at the /admin/imports
table, it's impossible to tell what went wrong if nearly all of the recorded errors are internal-error
or unknown-error
.
We can uses statuses like invalid-data
or missing-data
when an import item fails validation. Other statuses that reflect what actually happened can also be added (infobase_offline
, for example). As a catch-all unknown-error
can be used, but we should strive to identify what the error is and use a more illuminating status for those cases.
One thing that we will have to figure out is how to determine when the import process needs human intervention, and how to notify said humans. Here are some ideas that Mek and I discussed today:
unknown-error
s or internal-error
s have occurred.I happened upon a \u0000 char in a book title that broke the UI for both viewing and editing, I sent in a support request and it was fixed and I was pointed to this issue. In the support request I stated that this led to invalid JSON in the data dumps, but upon reading the JSON RFC I realized that it is actually valid JSON, it's just not supported by postgresql (which I use for importing the data dumps). My question is if perhaps the validator can (even if it is valid json) disallow null bytes (or rather the JSON escaped representation of them) in text data? To me there seems to be no valid usecase to allow null bytes, but it may cause issues with both the openarchive.org site and for consumers of the data dumps like me. Let me know if this belongs in a separate issue instead.
Want the import api to validate provided JSON, and reject any invalid documents.
Plan
One of the issues: currently the only json schema we have for this API resides in openlibrary-client: https://github.com/internetarchive/openlibrary-client/blob/99e600136f033a034690f947e3f09d930455c2c9/olclient/schemata/import.schema.json . We would like to avoid duplicating this if possible. Note longer term, these json schema files should be auto-generated by the server from the infogami type, since that is the actual ground truth. (eg https://openlibrary.org/type/work). Note this doesn't apply to the import api (?).
solr_types_generator.py
which generates aSolrDocument
classAdditional context
Stakeholders
@hornc @cdrini @mekarpeles @jimchamp