Catch all errors on upload

metasj commented 5 years ago

New uploads are currently uploaded, added to IPFS, pinned to our IPFS cluster, indexed in elastic, and classified.

At present, we catch upload failure and IPFS failure (confirm?) and show the uploader. If indexing fails we don't catch the elastic error, but could. If classification fails [on google's server], they don't give us any notice, but we can see when documents fail to have associated CPC classifications.

Case in point: a week back, uploads stopped being indexed because we'd been using a free-tier version of a service that shut down after hitting its monthly quota. This failed silently, but was noticed once an uploader was unable to find their own works after upload.

joeltg commented 5 years ago

Not sure what actions to take here

We should show CPC codes & lack thereof on the document page, the search page, and possibly on the org page if we have some kind of rich document list/preview thing. We don't now because there are no CPC codes to show
elasticsearch errors are caught and logged

slifty commented 5 years ago

Right now the scope of this issue is a bit broad -- catching "all" errors is a noble goal, but by nature is hard to ever confidently mark as "complete." I think the intended goal here is setting up a framework / tooling for more comprehensive system health monitoring. We want to know and be made loudly aware when pieces of our architecture are not working.

In some cases this will be powered by monitoring log streams for errors, in some cases it might be powered by a lack of expected processing (e.g. we expect X to run once a day). To do a good job of this is a significant undertaking and will involve a more comprehensive analysis of the possible failures for each system component in this project.

For instance this will include health checks to ensure the site is up, that the database is accepting connections, etc. @metasj just laid out a draft breakdown of the various pieces of the upload pipeline. The architecture diagram listed in the overview README.md provides still more system components.

We will want to be sure not too capture things that aren't errors (e.g. to @joeltg's first point -- is the lack of a CPC code on an individual document an error or simply a state? What could we do to correct it?)

prior-art-archive / priorartarchive.org

Catch all errors on upload #15