mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

SUGGESTION: Support exporting all types of variables names (not just basic ones) #88

Closed HenrikBengtsson closed 7 years ago

HenrikBengtsson commented 7 years ago

Background

It's not possible to export objects with other than "basic" names (= "according to R's variable naming rules"). For example,

library("batchtools")
reg <- makeRegistry(NA)
objs <- list(a = 1, `b` = 2, `c_{i}` = 3)
str(objs)
batchExport(export = objs, reg = reg)

gives

Error in source("export-non-standard-names.R", echo = TRUE) : 
  Assertion on 'export' failed: Vector must be named according to R's variable naming rules.

Enter a frame number, or 0 to exit   

1: batchExport(export = objs, reg = reg)
2: assertList(export, names = "strict")
3: makeAssertion(x, res, .var.name, add)
4: mstop("Assertion on '%s' failed: %s.", var.name, res)

User's can somewhat workaround this by exporting as batchExport(export = list(my_exports = objs)) and then add something like mapply(names(exports), exports, FUN = assign, envir = .GlobalEnv) but that's tedious, error prone, and not very efficient.

Simple suggestion: Reference / define "R's variable naming rules"

It's probably not fully clear to the user what "R's variable naming rules" is. I quickly tried to find a formal reference, but "all" I found was from ?make.names:

"A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words."

If that's your definition, I think it would be useful to reference that in from ?checkmate::assertList but also from ?batchtools::batchExport.

I think it would also be helpful to give some examples of valid and non-valid names in ?batchtools::batchExport. This would lower the risk for run-time surprises.

Actual suggestion: Why not support all variable names?

One might want to export objects with "non-standard variable names", e.g. locally defined infix or operator functions, e.g. `%my_op%` or c_{i}. Although these are unusual, they are certainly not unlikely.

As far as I understand it, the main reason for the current naming restriction is that the name of the exported object is restricted to what can be represented in a filename (roughly POSIX minus some symbols). This comes from loadRegistryDependencies():

  path = file.path(x$file.dir, "exports")
  fns = list.files(path, pattern = "\\.rds$")
  if (length(fns) > 0L) {
    ee = .GlobalEnv
    Map(function(name, fn) {
      assign(x = name, value = readRDS(fn), envir = ee)
    }, name = basename(stri_sub(fns, to = -5L)), fn = file.path(path, fns))
  }

However, if the name would be encoded in the actual filename, but instead be stored in the saved object, then this wouldn't be a limitation, e.g.

  path = file.path(x$file.dir, "exports")
  pns = list.files(path, pattern = "\\.rds$", all.files = TRUE, full.names = TRUE)
  if (length(pns) > 0L) {
    ee = .GlobalEnv
    for (pn in pns) {
      object <- readRDS(pn)
      assign(x = object$name, value = object$value, envir = ee)
      object <- NULL ## Fastest version of rm(list = "object")
    }
  }

I understand the history / legacy / convenience of having file names reflect the variable name, but is it really needed. Could you use a unique hash code for filename instead (as it looks like you're doing in the logs(?) directory. Alternatively, they could be incremental indices, or do something like base::make.names(names, unique = TRUE) allowing for some information in the filenames.

See also

This looks identical to what's going on in BatchJobs, cf. https://github.com/tudo-r/BatchJobs/issues/93

PS. Much of my feedback is now coming in as I've started to implement the future.batchtools wrapper and running initial tests on batchtools via that framework. Please, treat my issues / feedback / suggestions as what they are and feel free to discuss and say you disagree. You're actions / reply should not be the same regardless of me being a JOSS reviewer or not. I'm pretty sure you're doing just that, but I wanted to make it explicit it does not affect the JOSS review.

mllg commented 7 years ago

You are right, I am too restrictive regarding variable names. I always forget about the infix operators.

The reasoning for not storing the variable inside the RDS is that you have to read all files just to get the names, i.e. each time you call batchExport.

make.names() is not bijective, I would need something like a base32 encoder/decoder.

I guess I'll just store the real names in the registry and use a random string for the file names.

mllg commented 7 years ago

I've changed my mind. I think I will go with base32. See here: https://github.com/mllg/batchtools/pull/89

HenrikBengtsson commented 7 years ago

This is good news.

The reasoning for not storing the variable inside the RDS is that you have to read all files just to get the names, i.e. each time you call batchExport.

I didn't consider that batchtools might need to see what's already exported.

FYI, I've used utils::URLencode() / utils::URLdecode() in future.BatchJobs as a workaround for a long time now and it's been working very well. Haven't had any issues. My point is, I think your approach will work without surprises.

mllg commented 7 years ago

base32 encoding is merged. Should now work with all object names.

HenrikBengtsson commented 7 years ago

Awesome. I've verified this with makeClustersFunctionInteractive(external = TRUE) and makeClustersFunctionTORQUE().