golemcloud / golem

Golem is an open source durable computing platform that makes it easy to build and deploy highly reliable distributed systems.
https://learn.golem.cloud/
Apache License 2.0
530 stars 59 forks source link

Initial file system followup: Fix race condition in file-loader #1058

Closed mschuwalow closed 1 week ago

mschuwalow commented 1 week ago

I saw the test failure on master and looked into the cause.

The problem is that the file-loader uses the key of the intial file as the path of the file in it's local cache. This can run into the following sequence when multiple workers are using the same file:

  1. Worker A requires file a
  2. file loader downloads the file and stores a copy in ${cache_dir}/a. It returns an Arc to the worker that will delete the file when it's dropped
  3. Worker A is shutting down and drops the last Arc for file a. This removes it from the cache as weak.upgrade is now going to fail. The file is not yet deleted.
  4. Worker B requires file a
  5. file loader checks the cache and doesn't find it. It creates a new cache entry and starts downloading the file again to ${cache_dir}/a
  6. Drop for the entry owned by Worker A runs, the underlying file ${cache_dir}/a is deleted while Worker B is still using it
  7. Boom

I resolved this in this pr by additionally having a counter in the file_loader and just giving each locally created file a unique id, instead of using the key as the name. This way worker B will just create a new file in the above scenario while worker A deletes the old one.

Ran the integration_test suite 10 times locally to check that it works fine, hopefully the tests are not flaky now.

vigoo commented 1 week ago

Ran the integration_test suite 10 times locally to check that it works fine, hopefully the tests are not flaky now.

Just FYI, test-r supports zio-test style #[nonflaky(10)] annotation to verify that :)