scientific-python / spatch

BSD 3-Clause "New" or "Revised" License
2 stars 3 forks source link

Spatch requirements for reuse in other libraries #1

Open vyasr opened 3 months ago

vyasr commented 3 months ago

I was discussing how the NetworkX dispatching could be generalized to support reuse inside scikit-image for dispatching to cucim with @jookuma, @rlratzel, and @lagru. @jookuma put together a prototype in scikit-image/scikit-image#7466. Some of the thoughts that came up in this discussion (some of this functionality seems to already be supported, others are new, just collecting all of them here for completeness):

lagru commented 3 months ago

To add a few other aspects that we discussed

stefanv commented 3 months ago

Some general principles I'd personally like to see implemented:

betatim commented 3 months ago

There has to be a way to make a backend selection decision that takes into account more than just the types of the arguments. For example when passing a large image as a numpy array it can/is worth transferring it to the GPU, doing the work and transferring the result back. Whether or not doing the conversion is worth it might depend on the particular operation/function/algorithm though.

Similarly, it should be possible to pass "small arguments" as Python lists, instead of having to make them arrays of the same type as "the thing" you want to operate on. This is mostly a UX/convenience thing for interactive use. So you need a way to pick the backend while ignoring those args.

For something like scikit-learn you need to be able to do dispatching for stateful classes. The call to fit might select the backend, and then other methods (that are dispatched) need to use the backend selected during fit.

Another interesting thing is how to make sure that input validation is the same for different backends. In the sense that if you pass some invalid combination of arguments (or arguments + constructor arguments in the case of a class) you should get the same exception for different backends. Having to duplicate the validation code is probably not great, so we need some ideas on how to do this.

You can probably take this one step further and say that no matter what the backend, the possible exceptions and when they are raised should always be the same. Enforce this via testing?!

vyasr commented 3 months ago

I'd prefer for this "logging" to be on by default, with the option to suppress it (via env variable e.g.).

I agree with logging, but I think I'd prefer off by default. In my experience users typically ignore loud outputs unless they're requested.

A snippet of code, executed on different machines / in different environments, should, within practical limits, yield the same numerical results

Just want to note that due to associativity issues GPU backends are unlikely to ever produce numerical identical results, but should be fine within some tolerance. More generally, there may be APIs in the frontend library that have historically implicitly provided results with some property (e.g. consistent/stable ordering) that has not actually been promised as part of the API but that users have come to rely on. Dispatching will potentially force frontends to become more explicit about what exactly is being promised about results.

I think most of what @betatim suggested sounds good and is consistent with the other proposals discussed above.

seberg commented 3 months ago

There has to be a way to make a backend selection decision that takes into account more than just the types of the arguments.

To some degree, I think this is the elephant in the room, together with the point of reproducibility. For now, I would be happy to defer the decision of how this happens to scikit-learn, it could be any or all of:

All of the above make sense, what currently bothers me most is that there are two possible "cucim" backend meanings:

But when you think of function(numpy_array, backend="cucim") both would behave different. The first might error or return a cupy array, while the second would still return a NumPy array. I am not sure how to resolve this. Maybe a single backend can understand both, or split it into a cucim and a cucim[cupy] backend with some layer to make that convenient.

I agree with logging, but I think I'd prefer off by default.

I see logging as a very solvable issue. For example, we could keep counters about which backend was called for which function. So that the issue template can say: Run your code and then give us the information from skimage.backends.gather_callstats().

Verbose print-out on dispatching by default seems strange to me. But I think it can make sense in a scenario where pip install cucim or pip install cucim[autouse] activates the cucim (for numpy array) backend automatically (last bullet above). The user might never realize what is going on there, so printing out:

Auto-activating "cucim" backend for scikit-image, to hide this warning set env variable XY or `import cucim.skimage_backend` before `skimage`.

could make sense to me and should be the choice of scikit-image (even if it is probably hard to enforce that choice).

Separation: Preferably, there will be no code in the library that deals with specific details/workarounds for any given backend.

The only reason I can see for this would be to help users by having a stub-backend that says: Oh, you passed a cupy array, maybe you want to install a backend for that? But you can probably hook that into the array-conversion step failure if wanted.

betatim commented 3 months ago

For now, I would be happy to defer the decision of how this happens to scikit-learn, it could be any or all of:

What do you think of an additional option of "if cucim is installed it is used as a backend"? Maybe at first the user would have to activate it/dispatching in general via something like skimage.set_config(dispatching="cucim"). Or is that covered by one of your options?

In scikit-learn the current thinking is that users will likely only install one additional backend, but maybe more than one. For now users have to opt-in to dispatching. As part of the opting in they can specify their preference for the order in which backends are tried. Each backend is asked "do you want to handle this?". The first one to say "yes" gets it. If none say yes the default backend handles it (aka the code that is currently in scikit-learn).

This way a backend can say yes/no to work based on the input type, input size and algorithm hyper-parameters.

An open question is how to determine the order in which backends are tried in the case where the user opts in to dispatching without specifying one or in the future when it is on by default.

what currently bothers me most is that there are two possible "cucim" backend meanings:

I think it would be confusing for a user if there were two cucim backends. Why do you need that? One backend could handle both cases you described no? But it would have to keep track of the input type so that it can do the "back to numpy" conversion if needed.

grlee77 commented 3 months ago

But when you think of function(numpy_array, backend="cucim") both would behave different. The first might error or return a cupy array, while the second would still return a NumPy array.

cuCIM follows CuPy's policy of not automatically casting from NumPy arrays. An error will be raised on NumPy inputs to cuCIM functions.

seberg commented 3 months ago

"if cucim is installed it is used as a backend" [...] Or is that covered by one of your options?

That was supposed to be the last point, but not too clear :). I would see it as scikit-image's job to make guidlines, and I think it could go either way based on how much confidence we have in the mechanism (and confidence can grow over time).

Note that I think that these worries are unnecessary when we are dealing e.g. with CuPy inputs that scikit-image currently doesn't support (or at least only in niche things). I do think such a backend should just auto-activate always (via entry points).

This way a backend can say yes/no to work based on the input type, input size and algorithm hyper-parameters.

In an ideal world (but happy to not aim too high if it starts getting in the way). I might make it a two stage process:

  1. Just use types (and maybe hyper params?). Types could be hardcoded into the backend system (by listing types you want to/can handle) or just a "match/can-do" function. That has some potential advantages:
    • You can cache which backend is used, speeding up later dispatching.
    • If you know the types, you ensure that another backend won't think it should handle your type, no matter the "order" of backends.
      • This may just make it easier to not be sloppy about type matching.
      • When you think about type-dispatching, type information could be used to infer a backend order automatically. I am not sure how important it is, since it might be easier to just hardcode such order (i.e. my "cucim was slow for XY" backend knows it wants to have higher priority than "cucim" if it ever comes to that).
  2. A backend should be allowed to defer even if it matches in the first step:
    • Allow deferring for small problem sizes
    • Allow deferring when an implementation is incomplete.
    • (A recursive approach might work here as well, but deferring seems pretty nice to me and probably even more straight forward.)

I don't think starting with a type-based approach makes implementation significantly harder. The question below remains identical.

cuCIM follows CuPy's policy of not automatically casting from NumPy arrays. An error will be raised on NumPy inputs to cuCIM functions.

I think it would be confusing for a user if there were two cucim backends. Why do you need that? One backend could handle both cases you described no? But it would have to keep track of the input type so that it can do the "back to numpy" conversion if needed.

A backend can keep track of the input type, i.e. I could write a cucim backend that returns NumPy arrays for NumPy array inputs. We may even help cucim a bit to do that with the backend system.

So it is not much of problem to have a cuCIM backend (and maybe that is enough):

But if you for example look at cuGraph if you do func(nx_graph, backend="cugraph") it will return a cugraph object. (This is less problematic in NetworkX, because they duck-type pretty well I think, also not a lot of functions return graphs.)

But the point is, you can't do both, without some user choice. And that choice could look very different. If you go a bit of type-dispatching "all-in", you could even do:

func(in_type, backend="cucim") -> in_type
func.invoke(out_type)(any_type, backend="any backend") -> out_type or error

using a separate mechanism in case you ever care about types rather than backends (the last is what some type dispatching libraries do, IIRC).