Namespace Specific Reproducibility Tools

MilesMcBain commented 7 years ago

There's been a bit of chatter recently about function masking, something especially difficult for new users or users transitioning between say older packages and the tidyverse. This is most certainly a reproducibility issue since name collisions can lead to insidious bugs that are difficult to detect.

See: https://twitter.com/hadleywickham/status/843655581667803136 https://twitter.com/hrbrmstr/status/843642674259382274

One direction toward a solution is Python style modules, which apparently is basically solved. But there are many things that make me uncertain about this as way forward (happy to discuss).

From a reproducibility perspective it would be useful to be able to make testable assertions about the namespaces present when code is being executed. I like @hrbrmstr's suggestion of a single function ala needs(pack1, pack2, pack3) although I might just call it namespace or something so it sounds more declarative. Given a namespace assertion, there is the potential to create a tool that grants users various reproducbility powers:

Enusring the final execution environment contains only declared package namespaces, in the declared masking order.
Detecting potentially masked function calls and returning instances of those, with line numbers etc.
Detecting library()/loadNamespace() calls and prompting to remove.
Detecting conflicting namespace assertions in sourced files.

There's probably many more ways we can help I haven't thought of.

This idea seems to fit under the umbrella of the rrtools package proposed in #5.

karthik commented 7 years ago

@MilesMcBain Have you seen the work by Stefan Bache on this? It's less comprehensive than what you are proposing but it is a start. import does something similar to the Python style.

import::from(parallel, makeCluster, parLapply)

I see code where base filter masks tidyverse filter and in those cases suggest:

import::from(dplyr, filter)

MilesMcBain commented 7 years ago

I haven't until now. Thanks, this is great. I like the way it is designed to be used without being imported itself!

So if we could generate a list of conflicts this could be a simple approach to resolving them.

karthik commented 7 years ago

Tagging @smbache in case he's interested in weighing in.

sfirke commented 7 years ago

Thought of this thread when I got this message today from a colleague:

I was having problems using filter() and apparently it's because R thought I was referring to a different filter (not the one from dplyr). I fixed it by writing "dplyr::filter", instead of just "filter". Does anyone know what this could be due to? Is it about the order in which I include libraries? How can I go back to only writing "filter"?

My dream solution for this would be an add-in baked into an IDE where if there are two packages loaded with a function filter() and you type filter() it gets highlighted with a note saying, "hey, this is ambiguous; specify which package you want with package::" Essentially what @MilesMcBain proposes in his second bullet, but integrated into an IDE and constantly running.

smbache commented 7 years ago

I guess it would be possible to make a utility function in import to list conflicts. But would be nicer with an IDE solution..

But yes, the idea behind import was to avoid conflicts in the first-place, by being explicit about which functions are imported from which package (helps both future readers of the code and yourself by avoiding these kinds of errors, which can even arise post writing the scripts as packages may later export more names). In general I avoid using library in scripts and Depends in packages.

MilesMcBain commented 7 years ago

@sfirke I can see your solution working nicely in RStudio.

I guess my own vision was of something that interferes a bit more. My experience has been a lot of kicking off long running model fits or simulations. If something like this slips through it can be a real waste of time, at best. I'll have to try the workflow that @smbache is proposing, that may actually be a nicer way to get some assurance about what is in the namespace (nothing but that base!).

MilesMcBain commented 7 years ago

I'm hesitant to push this further before I have had a chance to try @smbache's approach IRL. Closing for now.

@sfirke's addin is a great idea, but we don't currently have the ability to create the types of "First Class" addins or plugins for RStudio that would make the user experience really good. This is something I plan to get in a few people's ears about at the unconf, so look out RStudio crew 😛 .

sfirke commented 7 years ago

Hey hey: this issue bit me just today while working on an unconf project. I didn't put dplyr::filter and spent a while not understanding an error message - thought the cause was something entirely unrelated and went down a rabbit hole - before this StackOverflow answer alerted me that I was probably inadvertently calling stats::filter.

😡

ropensci / unconf17

Namespace Specific Reproducibility Tools #22