grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

New files in directories not detected #67

Closed olgabot closed 5 years ago

olgabot commented 6 years ago

Hello, I re-ran a workflow that operated on a directory when new files were added, but the cache was somehow not updated to use the new files. I could tell because the output is a square csv of distances between cells, and this distance matrix was only 718x718 but there should be almost 1000 samples in the folder. If I added -cache=off then the program would fetch ALL the files in the directory, but I'm wondering - why did additional files not get detected? Did I need to specify something else?

Here is a gist with the log file of the execution.

Thank you! Warmest, Olga

mariusae commented 6 years ago

Hi Olga,

This is a known issue with s3 that we're working to address. The basic issue is that Reflow assumes that an operation (such as dir("s3:///.../")) is going to produce the same results every time. This is true of, e.g., execs, where the operation is defined in terms of the command and their (concrete) dependencies; but it's trickier for S3 because all we have to go on is the URL itself.

The plan is to record, alongside the cache key, a set of "assertions" about that key, which will include the contents of the directories as well as their S3 e-tags (which aren't necessarily usable from Reflow directly, but can be used to tell if a file's content changed); this is very high on our list of things to fix. (The general "assertion" mechanism is useful for other purposes as well.)

In the meantime, you can invalidate single operations by using the -invalidate flag. The flag takes a regular expression of identifiers (module.ident) to invalidate. If you use this with bottomup evaluation, it should do the trick.

For example, for the following module

aeon $ cat /tmp/test.rf
val files = dir("s3://grail-marius/example")

@requires(cpu := 1)
val Main = files

I can run the following command to re-download the files, without invalidating or skipping caching for other computations:

$ reflow run -eval=bottomup -invalidate=test.files /tmp/test.rf 
...
    ident      n   ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
    test.files 1   0      0B                                         

(Note the ncache=0: test.files was not retrieved from cache.)

I know this is a kludge. It will get a lot better.

mariusae commented 5 years ago

This is now being handled by assertions: https://github.com/grailbio/reflow/commit/e205ec2df52882f63041a302d20c40aecd117255#diff-275f8ed39b866d9c0f357447a6346800