Closed olgabot closed 5 years ago
Hi Olga,
This is a known issue with s3 that we're working to address. The basic issue is that Reflow assumes that an operation (such as dir("s3:///.../")
) is going to produce the same results every time. This is true of, e.g., execs, where the operation is defined in terms of the command and their (concrete) dependencies; but it's trickier for S3 because all we have to go on is the URL itself.
The plan is to record, alongside the cache key, a set of "assertions" about that key, which will include the contents of the directories as well as their S3 e-tags (which aren't necessarily usable from Reflow directly, but can be used to tell if a file's content changed); this is very high on our list of things to fix. (The general "assertion" mechanism is useful for other purposes as well.)
In the meantime, you can invalidate single operations by using the -invalidate
flag. The flag takes a regular expression of identifiers (module.ident
) to invalidate. If you use this with bottomup
evaluation, it should do the trick.
For example, for the following module
aeon $ cat /tmp/test.rf
val files = dir("s3://grail-marius/example")
@requires(cpu := 1)
val Main = files
I can run the following command to re-download the files, without invalidating or skipping caching for other computations:
$ reflow run -eval=bottomup -invalidate=test.files /tmp/test.rf
...
ident n ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
test.files 1 0 0B
(Note the ncache=0
: test.files was not retrieved from cache.)
I know this is a kludge. It will get a lot better.
This is now being handled by assertions: https://github.com/grailbio/reflow/commit/e205ec2df52882f63041a302d20c40aecd117255#diff-275f8ed39b866d9c0f357447a6346800
Hello, I re-ran a workflow that operated on a directory when new files were added, but the cache was somehow not updated to use the new files. I could tell because the output is a square csv of distances between cells, and this distance matrix was only 718x718 but there should be almost 1000 samples in the folder. If I added
-cache=off
then the program would fetch ALL the files in the directory, but I'm wondering - why did additional files not get detected? Did I need to specify something else?Here is a gist with the log file of the execution.
Thank you! Warmest, Olga