sul-dlss / sul-embed

An oEmbed Service for Stanford University Libraries
Other
19 stars 6 forks source link

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

Open andrewjbtw opened 2 months ago

andrewjbtw commented 2 months ago

Occasionally, the imageserver stops serving images in the stage environment. This generally shows up in sul-embed as network "failed to fetch" errors. When this happens, the failures appear to be across the board rather than affecting only a subset of SDR items.

There are two imageserver nodes in stage and you can view their status at these links: http://sul-imageserver-stage-a.stanford.edu/health http://sul-imageserver-stage-b.stanford.edu/health

When the imageserver is having a problem, one or both of those health checks will have the "color" of "RED". There will also be a message like:

“/stacks/dm/057/nt/0476/asawa.jp2 (No such file or directory) (dm/057/nt/0476/asawa.jp2 -> edu.illinois.library.cantaloupe.source.FilesystemSource)”

In each of the instances where I've investigated this issue, the file that's reported missing in the error message is a file that appeared to have been deleted properly. By "deleted properly" I mean that I've checked Argo and the item history shows that someone intentionally removed the file, which happens when someone changes the file's "shelve" status to "no". These have not been cases where the file was deleted outside of SDR processes, like someone going to the filesystem and just deleting it.

To resolve this error, what I've done is put the file back up at the path indicated in the message. After doing that, the healthcheck turns back to green. The imageserver doesn't seem to care that the file is the same file as before, just that a file appears at the indicated path. You could probably just do touch /path/to/missing/file and clear up the check.

The odd thing is that once the error is cleared, we've found that you can then delete the same file and the error will not come back.

It is not clear what specifically generates this error, but since it apparently takes the whole server down, we would benefit from figuring out what's going on. I should note that I have never seen the issue in production, only stage.

Frequency of occurrence

The first time I remember being aware enough of this issue to monitor it was 2023-09 (see related Slack thread).

This happened again on 2024-05-01. The imageserver reported a file missing from an item that I had made dark. I reaccessioned the item to shelve the file and then the healthcheck turned green. Deleting the file again later did not trigger a recurrence of the error.

I had not been tracking occurrences, so those are the only two that I can identify with certain time frames.

Based on standup discussion this morning (2024-05-02) we decided we should create an issue, if only to have a place to track recurrences.

andrewjbtw commented 2 months ago

Also, this issue is starting in embed because it's not clear if we have a more specific repo for it.