klmr / box

Write reusable, composable and modular R code
https://klmr.me/box/
MIT License
829 stars 47 forks source link

Search for modules after packages has been exhausted #307

Open jtlandis opened 1 year ago

jtlandis commented 1 year ago

Please describe your feature request

Lets assume you have set the systems R_BOX_PATH environment variable to something like /opt/box. It would be convenient to take advantage of the __init__.r usage pattern with modules within /opt/box.

For example, suppose the following directory structure in /opt/box

foo/__init__.r

#' @export
box::use(./mod1[...], ./mod2[...])

foo/mod1.r

#' @export
bar <- 1L

foo/mod2.r

#' @export
baz <- 2L

Then, the expected result may look something like this:

box::use(foo)

foo$bar # 1L
foo$baz # 2L

Instead, we currently get an error dictating that there is no package named foo. Thus back to my point, to there being some convenience if box would then start searching the paths.

I am aware that if R_BOX_PATH had /opt set, then the following usage may work

box::use(box/foo)

But in my case, there may be other things within /opt that I do not want the box package to walk over.

klmr commented 1 year ago

This is a thoughtful feature request but it is at odds with the current, intentional design. There are two classes of problems with this design, which I’ll explain in the following. But first, here’s how globally installed (as opposed to project-local) modules are intended to be used:

Rather than have your module foo stored directly in the module search path, nest it inside a folder that represents an organisation, product or user name — exactly like a GitHub project. That is, you could use the path /opt/box/jtlandis/foo and you would import it via box::use(jtlandis/foo). This is commonly known as “namespaced packages” or “scoped packages”. All global modules are supposed to be scoped.


R packages are not scoped, and this has several problems. Foremost, it leads to name clashes between different sources. In the case of R, CRAN packages always take precedence. This isn’t just a theoretical problem: for years, ‘box’ used to be only hosted on GitHub, and it had to be renamed three times (‘modules’ → ‘mod’ → ‘pod’ → ‘box’) because successive packages on CRAN “stole” its name. There are other, arguably more serious problems (e.g. it facilitates security vulnerabilities via typosquatting of module names).

On balance, I feel that the pros of requiring scoped module names outweigh the cons. But I am open to counter-arguments. (For a broader perspective, the issue of scoped vs. flat module names is a extremely widely debated issue across module systems in many different languages — e.g. here — and while the debate is far from settled most people seem to agree that having scope would generally be better than not having them, and that retrofitting them into a flat system is problematic.)


The other issue with your proposal is that it unfortunately introduces an error-prone API. Once again the culprit is name clashes: consider the case where, in the future, you install an R package ‘bar’ on your computer. Installing ‘bar’ automatically installs its dependencies, and one of the dependencies is called … ‘foo’.

And suddenly your code breaks, without any changes to it. Simply because (indirectly) installing the package ‘foo’ has changed what code gets loaded by box::use(foo). Even worse, if you are really unlucky the code will seem to continue to work, because both the module and the package might define a function of the same name that is called in your code, but which does something slightly different. The code still runs, but it produces wrong results.

This kind of silent errors is something that ‘box’ aims to avoid at any cost. In other words, if there will ever be the possibility of using flat module names, they won’t be usable simply via box::use(modname); there needs to be some syntactic distinction from package names.

jtlandis commented 1 year ago

Would you be open to expanding the box syntax to indicate a global module vs a package name? the formula operator ~ is currently unused and may be helpful in this regard that we are searching the global search paths (and not a package).

Maybe something like:

box::use(~foo, foo/mod1)

mod1$bar
# 1
foo$bar
# 2

I understand the intention behind wanting some projects namespaced, I just enjoy being able to use the __init__.r script and it feels odd to me that we cannot use these on scoped projects.

I'll leave the final decision in your hands, but if you like the idea, I can revise the pull request to try and implement this feature.

klmr commented 1 year ago

I just enjoy being able to use the init.r script and it feels odd to me that we cannot use these on scoped projects.

I’m not sure what you mean by that: you can use __init__.r scripts, they’re just regular modules (where the module name is the parent folder’s name). The only requirement is that they are nested inside the a parent namespace; but the same is true for all modules, regardless whether they are implemented in a file called __init__.r or something else. Using the example from above, you could have a module implemented in the file /opt/box/jtlandis/foo/__init__.r or /opt/box/jtlandis/foo.r.


That being said, I am not dogmatically opposed to “flat” module names. Maybe I need to revisit my objection to them. As for syntax, I’m unsure what would be best. ~mod works but feels arbitrary. Ideally I would like to have a syntax that is descriptive, in that even somebody not familiar with it can make an educated guess at the meaning (or at the very least to facilitate remembering the correct syntax).

I have no definitive answer but here are a few other ideas that feel like they might make more sense (in no particular order). Apologies, unfiltered stream of consciousness ahead:

… of course this is non-exhaustive; one of the dangers of using NSE is that you can go wild with the syntax.

One more consideration: I’ve used global in the above to denote globally installed modules, in contrast to project-local modules with a relative path. Alternatively, it might more sense to use verbatim-mod to denote that this loads a module rather than a package.

… it’s fair to say that I’ve painted myself a bit into a corner, syntactically, by making box::use(pkg) load a package.

jtlandis commented 1 year ago

My logic behind using ~ is that for file paths it implies relative to the users home directory, so to me it doesn't seem like too far of a leap for it to imply "relative to global paths", but it may feel arbitrary to others.


I’m not sure what you mean by that: you can use init.r scripts, they’re just regular modules (where the module name is the parent folder’s name). The only requirement is that they are nested inside the a parent namespace; but the same is true for all modules, regardless whether they are implemented in a file called init.r or something else. Using the example from above, you could have a module implemented in the file /opt/box/jtlandis/foo/init.r or /opt/box/jtlandis/foo.r.

I understand how the __init__.r scripts work. I mean to express that the namespace itself can hold a __init__.r script but cannot be called by just the parent folder's name, hence the motivation behind me opening this issue or my last suggestion with ~. It seems that you want users to structure their global modules a certain way. I'm here to tell you that I would like the option to call a global module's __init__.r script without the need of the following syntax...

box::use(jtlandis/`__init__`)

I hope this clears things up.

If this feature request does not make it into the package I am also okay with that. I can always restructure my code.

klmr commented 1 year ago

I hope this clears things up.

It does, thanks.

I still haven't decided how to best tackle this feature request but I will add some way of using unscoped modules.

klmr commented 1 year ago

Apologies for the radio silence, it was a combination of life happening and me having issues with my computer which lead to my dev environment crashing constantly.

Anyway, please have a look at the work in progress PR #316 (branch feature/unscoped-mod-names) which implements a preliminary version of this feature.

For now (and I need to stress that this is temporary!) the syntax for loading an unscoped module global_module is as follows:

box::use(mod(global_module))

Here, mod() works as a disambiguator between module and package name. So it does not indicate that the module is unscoped/global, it indicates that what follows is, in fact, a module. The same syntax will also work with regular, scoped module names (in which case it would be redundant, however).

Before settling on a final syntax and merging this feature I would appreciate test-driving this feature.

jtlandis commented 7 months ago

I understand life - I too have been very busy. I will give this feature a go in my day to day. Cheers!