insitro / redun

Yet another redundant workflow engine
https://insitro.github.io/redun/
Apache License 2.0
517 stars 43 forks source link

More elaborate mechanism to detect code changes? #99

Open codethief opened 3 months ago

codethief commented 3 months ago

Consider this slightly modified version of 01_hello_world:

from redun import task

redun_namespace = "redun.examples.hello_world"

planet = "World"  # assume this to be a constant

@task()
def get_planet() -> str:
    return planet

@task()
def greeter(greet: str, thing: str) -> str:
    return "{}, {}!".format(greet, thing)

@task()
def main(greet: str="Hello") -> str:
    return greeter(greet, get_planet())

If I change the constant planet to another value like "Mars", redun will not bust the cache and instead still yield the output "Hello, World!" for the main task. This is of course not entirely surprising given how redun detects code changes, but I'd still say it represents a rather big footgun for people who are used to writing "normal" Python code.[0]

Similar situations arise when people use regular (non-@task) Python functions in their code.

In light of this, would it be feasible to offer an alternative caching mechanism which is less aggressive, i.e. more careful about what code changes could cause a task to change? Two ideas that come to mind:

  1. File changes to a given Python module could always bust the cache of all tasks in the module. Of course this would invalidate the cache much more frequently but, personally, I think I would still prefer this over accidentally running into incorrect cache hits time and again.

  2. Use Python's trace module to track calls to other functions and, more generally, which lines of code get executed. Use the content of those lines to compute a hash.

[0]: I am looking into using redun to define build & CI/CD pipelines for a (largely) non-Python project. Since the CI/CD code will not be something people will work on all the time, I would like to reduce footguns like erroneous cache hits as far as possible to make it easier for my coworkers to contribute to the pipelines.

mattrasmus commented 3 months ago

Hi @codethief thanks for asking this.

redun does use a conservative way of hashing code by default, which is to just hash the string of the source code within the task definition. That does mean that changes to constants defined outside the task will not get detected as changes that would bust the cache.

For common constants like that, one idiom is to make sure they are passed through as arguments to the task as default values.

MY_CONSTANT = "value"

@task
def foo(x, my_constant=MY_CONSTANT):
   # Now you can use x and my_constant

Changing "value" to something else like "value2" will cause an argument hash change and a proper re-execution of task foo.

More complex tracing or static analysis is not done because the results can often be surprising/unpredictable to users, such as a change to code imported from another module or library causing a cache bust. There might be a to define it wher eyou can have your cake and eat too, but thus far we've stuck to a conservative approach. We discuss some of the related details in the docs: https://insitro.github.io/redun/tasks.html#task-hashing

However, if you would like to implement a more complex hashing approach, especially for a build tool system, perhaps you might offer your users a task decorator that wraps redun's @task.

from redun import task

def build_step(**options):
    def deco(func):
        version = calc_complex_hash(func). # Here, you implement something more sophisticated.
        return task(version=version, **options)(func)  # Here we use redun's manual versioning to influence the task hash.
    return deco

@build_step()
def my_build_step(x, y):
    # ...

Let me know if that helps. We have implemented several wrapper task decorators internally for situations where we want to customize the task definition process.

Similar situations arise when people use regular (non-@task) Python functions in their code.

For cases like this, we do support an opt-in approach like this:

def double(x):
    # A normal function
    return 2 * x

@task(hash_includes=[double])
def my_task(x):
    return double(x) + 2

my_task's hash now will also depend on the source code of the double function.

codethief commented 3 months ago

Hi @mattrasmus, thanks so much for the elaborate response!

More complex tracing or static analysis is not done because the results can often be surprising/unpredictable to users, such as a change to code imported from another module or library causing a cache bust.

To be honest, I find such complex tracing to be the more conservative option, given that it matches at least my expectations of how caching should behave more closely. Then again, you seem to have put a lot of thought into this and you surely have more experience with what users expect and don't expect.

In fact, now that I'm thinking about it, I could imagine one situation where some "complex static analysis" could lead to pitfalls like the one below (maybe you had something similar in mind?):

If task A depends on task B and B is changed by the user, naive static analysis would likely invalidate the cache for A, too. However, it's probably desirable for the cache for task A to not get invalidated automatically – at least as long as B still returns the same value. Conceptionally, the return value of B should just be another input parameter to A whose hash gets incorporated into the cache key.

Anyway, the idea to use a custom decorator is very neat, as this will give me the option to experiment a bit. I'll report back!

mattrasmus commented 2 months ago

If your interested in understanding more the subtlety of caching for tasks that call other tasks, check out what we discuss in the implementation notes related to single and ultimate reductions: https://insitro.github.io/redun/implementation/graph_reduction_caching.html

I'm also interested to hear how the build system goes! I've always imagined redun might help in such a case.