shikokuchuo / mirai

mirai - Minimalist Async Evaluation Framework for R
https://shikokuchuo.net/mirai/
GNU General Public License v3.0
193 stars 10 forks source link

globals: How to use a function with a free variable that lives in the global environment? #45

Closed HenrikBengtsson closed 1 year ago

HenrikBengtsson commented 1 year ago

Consider:

f <- function(x) a + x
a <- 1

This can be called as:

> f(2)
[1] 3

How can this be called via mirai()? The following attempt doesn't work:

> library(mirai)
> f <- function(x) a + x
> a <- 42
> m <- mirai(f(3), f = f, a = a)
> m$data
'miraiError' chr Error in f(3): object 'a' not found

Session info

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /home/henrik/shared/software/CBI/R-4.2.3-gcc11/lib/R/lib/libRblas.so
LAPACK: /home/henrik/shared/software/CBI/R-4.2.3-gcc11/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mirai_0.8.1.9005

loaded via a namespace (and not attached):
[1] compiler_4.2.3 nanonext_0.8.1
shikokuchuo commented 1 year ago

Henrik, good to see you here!

Very interesting example. This of course concerns R's lexical scoping rules.

The issue here is that the function f is not self-defined but relies on R to find a either in a parent environment or the .Globalenv. In this case, as f lives in the global environment, that is the enclosing environment - it can only find a in .Globalenv.

mirai currently constructs an environment consisting of '.expr' and the passed variables, sends this off and evaluates the expression in that environment, not the Global Environment of the remote process. And this is what causes the subtlety.

If I were to assign the variables to the Global Environment and then evaluate the expression in that context, then your example above works 'as expected'. I say this advisedly as you have told me what you wanted. But in other cases it could well be a bug in the code. It would be great to hear your thoughts on this.

It is a bit of an oddity that evaluating in the context of the global environment is different to evaluating in a 'clean' environment, which is all I guarantee at the moment (as documented).

But I guess the expectation should be that your example works? If so, it would be a simple change. But the fact that it hasn't come up yet suggests such examples are rare.

wlandau commented 1 year ago

For what it's worth, the current policy in crew is to take manual control using crew_eval(), which accepts both local and global objects, temporarily sets global state, and evaluates the expression in a new temporary environment which inherits from .GlobalEnv. The return value is a monad with the result and metadata. controller$push(command = f(1), data = list(f = f, a = a)) hits the same issue for the same reason, but controller$push(command = f(1), data = list(f = f), globals = list(a = a)) works.

shikokuchuo commented 1 year ago

Thanks! I haven't thought about the problem long enough to know what the correct default should be.

The current implementation obviously breaks the code above, but that is arguably 'bad' code. Does evaluating everything in the Global Environment just work, or could it introduce other subtleties?

It is just as easy in terms of implementation - I simply send across the arglist, and do a list2env(envir = .GLobalenv) in the server process rather than beforehand.

wlandau commented 1 year ago

It is just as easy in terms of implementation - I simply send across the arglist, and do a list2env(envir = .GLobalenv) in the server process rather than beforehand.

Not to make a judgement call here, but if I remember correctly, this is what clustermq currently does.

shikokuchuo commented 1 year ago

For what it's worth, the current policy in crew is to take manual control using crew_eval(), which accepts both local and global objects, temporarily sets global state, and evaluates the expression in a new temporary environment which inherits from .GlobalEnv.

I see in your solution you make a distinction between data and globals. This would seem to allow for example specifying a as argument to a function g(a) and a different a as a global which would be captured by a function such as f. Is this intentional?

Because if I put everything in the global environment then you would only have one a.

wlandau commented 1 year ago

I see in your solution you make a distinction between data and globals. This would seem to allow for example specifying a as argument to a function g(a) and a different a as a global which would be captured by a function such as f. Is this intentional?

Yes, this is technically allowed. But as you said regarding the original example, maybe that would be considered sloppy code. I would hope the scoping would choose the correct a: the local one for the function argument and the global one for the global variable hard-coded in the function body.

shikokuchuo commented 1 year ago

From mirai() docs:

The expression '.expr' will be evaluated in a separate R process in a clean environment consisting only of the named objects passed as '...' and/or the list supplied to '.args'.

The precise nature of the problem is that f is a closure, not just any object. Passing in a closure f, which is enclosed by the Global Environment and not specifying the variable there.

If you want to pass in f then you need to change its closure:

m <- mirai({environment(f) <- environment(); f(3)}, f = f, a = a)
m$data
[1] 45

or else assign the 'free' variable to its closure.

m <- mirai({environment(f)[["a"]] <- a; f(3)}, f = f, a = a)
m$data
[1] 45

The 'solution' that makes everything all work by assigning to .GlobalEnv just masks the 'sloppy' code. I'm glad I was hesitant there.

shikokuchuo commented 1 year ago

@HenrikBengtsson in summary, if an interface for specifying globals is required then I can suggest the method used by crew (essentially an inner evaluation wrapper).

HenrikBengtsson commented 1 year ago

mirai currently constructs an environment consisting of '.expr' and the passed variables, sends this off and evaluates the expression in that environment, not the Global Environment of the remote process. And this is what causes the subtlety.

Thanks for these details. This was what I suspected and the answer I was after (wanted to be confirmed).

But I guess the expectation should be that your example works? If so, it would be a simple change. But the fact that it hasn't come up yet suggests such examples are rare.

I'd say, in general, it depends on what one wish the API contract should be. It's perfectly fine to hand off the task to support this to whomever uses the API, as long as there is a solution. In your case, doing what @wlandau's suggestion works fine.

The background are common use cases that stem from code that does things like:

X <- some_vector(5)
y <- lapply(1:5, FUN = function(kk) sqrt(X[kk]))

The best-practices approach to the above is:

X <- some_vector(5)
y <- lapply(1:5, FUN = function(kk, x) sqrt(x[kk]), X)

but, the first form is still very common. From my perspective in the Futureverse, where I want to support simple future alternatives to the above, I'm always assuming there could be globals (here X) passed that needs to be available on the parallel workers. Because of this, the Future API has a built-in concept of "globals" and an explicit argument for it.

shikokuchuo commented 1 year ago

Yes, I agree both forms above would indeed be common. And both would work in a mirai, if X is passed as an argument so that it is available in the evaluation environment.

Your initial case consisted of passing in a function - which is in the Global Environment - which also relies on a value being in the Global Environment, and not passing that value to the Global Environment where it is required to be.

I would say that such functions mostly exist in packages, or within functions, where the 'free' variable is defined in a parent environment. I still think it is rare for a variable to need to be explicitly in the Global Environment.

I can appreciate that the 'globals' concept can be user-friendly. But I tend to err on the side of making things as general as possible. One of the consequences seems to be that it is incredibly flexible to overlay a specific interface if desired as @wlandau has done. Similarly it is possible to create an interface for future's 'globals' (outside of the package).

If on the other hand there are other reasons for 'globals' such that having an interface is handy, I would be very interested to know.

HenrikBengtsson commented 1 year ago

All this sounds good to me.

Just sharing more of my thoughts on this: I wouldn't mind if R tightened the gap and would not search the global environment in these examples and still give an error. That would make things more robust and make it easier to compute on the code. A eval_without_globalenv() that handles all scenarios would be a dream. But I don't think such a change to R would happen any time soon.

FWIW, this issue is closely related to the discussion around .args start with https://github.com/shikokuchuo/mirai/issues/49#issuecomment-1497815283.

shikokuchuo commented 1 year ago

A personal favourite would be something that was suggested in the R devel mailing list - an assignment operator that assigns to the parent environment i.e. precisely one level up, so we don't get mis-assignments to the global environment.

The equivalent of:

Rf_defineVar(sym, SEXP, ENCLOS(R_GetCurrentEnv()));

Except that the above doesn't actually work as R_GetCurrentEnv() is currently broken, and doesn't look like it is getting any attention either, despite having raised it on R-devel and Lionel Henry having posted a Bugzilla fix for it.