octue / octue-sdk-python

The python SDK for @Octue services and digital twins.
https://octue.com
Other
10 stars 4 forks source link

Make local/temporary file/directory deletion thread-safe #666

Open cortadocodes opened 4 months ago

cortadocodes commented 4 months ago

Bug report

What is the current behavior?

The Runner class and service configuration have a delete_local_files argument which deletes any datafiles downloaded and any registered temporary directories at the end of an analysis run. We know (and warn) that this isn't thread-safe. However, we've recently seen the same consequences of thread unsafety in what should be a thread-safe environment (separate containers running separate processes with supposedly separate filesystems in Cloud Run).

To get our containers to run reliably, we've recently had to disable local file deletion (delete_local_files=False is now the default). We now have a potential problem of running out of memory when running our containers in Cloud Run where, by default, local storage is actually in-memory. We need to find a way to clean up files from one analysis before the next is run.

The actual error we saw was a FileNotFoundError which would occur when using functions from the os.path module e.g. when trying to get the relative path of a datafile in a dataset so it could be downloaded/uploaded or when trying to get the absolute path of a metadata file.

We triggered this error when asking 24 questions (6 in parallel at a time) to one of our services with concurrency set to 1 (1 request (question) at a time per container). The first several questions would work ok (between 4 and 8), then the rest would fail. We think what happened was this:

  1. The first few questions would get new containers
  2. These questions would finish and clean up successfully
  3. The rest of the questions would be taken up by the containers that were already running and would attempt to use temporary directories. This would somehow fail because of the previous cleanup.

So, despite concurrency being 1, each container would attempt to answer multiple questions in serial. This should have been fine if each container had its own filesystem, but a new shared filesystem (between containers in a Cloud Run service instance) is being trialled in Cloud Run that we think makes it effectively thread-unsafe to clean up like this. I think it's possible that separate containers are interfering with each others list of registered temporary directories; it's also possible it's a problem within each container but caused by serial analyses.

What is the expected behavior?

We can automatically delete downloaded and temporary files/directories at the end of each analysis run.

Proposed Solution

Some ideas: