ropensci / rix

Reproducible Data Science environments for R with Nix
https://docs.ropensci.org/rix/
GNU General Public License v3.0
177 stars 15 forks source link

Feature suggestion: Inline anotations and command line #286

Open jrosell opened 2 months ago

jrosell commented 2 months ago

I tried the {rix} package today and I think about two features that could make it more awesome for R development.

Let me give an example here.

data-visualize.R file

library(here)
library(dplyr)
library(tidyr)
library(ggplot2)
library(palmerpenguins)
library(ggthemes)
library(R.devices)

str(penguins)
p <-
  penguins |> 
  drop_na() |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
suppressGraphics(ggsave(filename = 'penguin-plot.png', plot = p))
if (interactive()) {
  utils::browseURL(here('penguin-plot.png'))
}

rix file

if [ "$#" -eq 0 ]
then
  echo "Please, provide the file to run as the first argument should be the file to run."
  echo "For example: bash rix \$(pwd)/data-visualize.R"
  exit 1
fi
FILE_TO_RUN="$1"
CODE_TO_RUN=`cat $FILE_TO_RUN`

nix-shell \
    --expr "$(Rscript -e 'rix::rix(r_ver = '\"'4.3.3'\"', r_pkgs = c('\"'here'\"','\"'ggplot2'\"', '\"'dplyr'\"', '\"'tidyr'\"', '\"'palmerpenguins'\"', '\"'ggthemes'\"', '\"'R.devices'\"'), system_pkgs = NULL, git_pkgs = NULL, ide = '\"'code'\"', overwrite = TRUE, print = TRUE)')" \
    --run "Rscript -e "'"'"$CODE_TO_RUN"'"'"" 

The first feature is a rix command line tool. For example, one can run: bash rix $(pwd)/data-visualize.R to generate the ''penguin-plot.png' plot.

The second feature is inline script metadata for R like python already have.

If you look at my code for the rix file I already set the rix R command in the nix-shell call but I think it could be anotated some way in the file to be run.

Let me know what you think.

philipp-baumann commented 2 months ago

Hey @jrosell thanks for your ideas. I'm not yet sure, do i get you correctly that you imagine a wrapper fun around rix::rix() that generates the above shell script? I think its sufficient and easier to just write a really short R script that defines the environment:

# env.R
rix::rix(
  r_ver = "4.3.2",
  r_pkgs = "data.table",
  overwrite = TRUE,
  project_path = "./my_proj_subdir"
)

Run that env.R in your R session or via Rscript.

Then use a custom bash , nix-rscript.sh with nix-shebang syntax that could be part of inst/extdata and a helper to copy it to the current proj dir, to be implemented. chmod +x

#!/usr/bin/env nix-shell
#! nix-shell -i bash --pure default.nix
Rscript  \
  --no-site-file \
  --no-environ \
   --no-restore \
  ${1}

And just

./nix-rscript.sh data-visualize.R
philipp-baumann commented 2 months ago

To sum up, maybe like a littler script helper. https://nix.dev/tutorials/first-steps/reproducible-scripts.html is a nice ref.

jrosell commented 2 months ago

Well, the goal is to have rix anotations at script level so one can run something like: rix run script.R

In Python one can do inline anotations and run: uv run script.py

philipp-baumann commented 2 months ago

Well, the goal is to have rix anotations at script level so one can run something like: rix run script.R

In python one can do with inline anotations and run: uv run script.py

Well, these are at least two pair of shoes. I like those inline annotations. It needs a lot on top of the nix-Rscript runner. Tooling in nix? Nix ensures reproducibility via output hashes and based on inputs in the expression supplied. Currently, rix boilerplating assumes one fixed nixpkgs revisions/git hashes for specific packages, but in principle it could be extended to multiple. There is quite a bit of tooling needed so we can leverage renv lockfile to-nix work (see https://github.com/b-rodrigues/rix/issues/5 ) . Ideas and PRs are very welcome. Wanna join the Nixpkgs R matrix channel? Could be a good place to brainstorm, too.

jrosell commented 1 month ago

Here is what I have. It works in Ubuntu: https://github.com/jrosell/rix-run

b-rodrigues commented 1 month ago

that's really cool, I must admit that I didn't really understand what you meant but now that I see it, it's really nice!

How would you like to move forward with this? Would you like to have it included into rix? We are in the process of submitting to CRAN very soon so now wouldn't be the right moment to add a completely new feature, however if you want to continue to work on it feel free, and we could merge a PR for a next release.

jrosell commented 1 month ago

I think that the rix-run script belongs to rix, but I belive that the script should work fine on more systems. So, we can wait.

jrosell commented 1 month ago

To keep you update, it turns out that rix-run plays well with targets script file too. I really like the ability to have multiple target scripts in the same project.

https://github.com/jrosell/rix-run?tab=readme-ov-file#targets-single-file

jrosell commented 3 weeks ago

I thought about this idea and I think it could be taken further using {processx} as {callr} do.

I imagine something like this for testing same function on diferent R versions using nix shell processes.

bench::mark( rix::run(rix::rix(v_ver="4.3.1"), my_function), rix::run(rix::rix(v_ver="4.4.1"), my_function) )

What do you think?

philipp-baumann commented 2 weeks ago

I thought about this idea and I think it could be taken further using {processx} as {callr} do.

I imagine something like this for testing same function on diferent R versions using nix shell processes.

bench::mark( rix::run(rix::rix(v_ver="4.3.1"), my_function), rix::run(rix::rix(v_ver="4.4.1"), my_function) )

What do you think?

Running R functions in different Nix R environments is exactly what with_nix() that I implememented does. see e.g. https://github.com/ropensci/rix/blob/287e8bd5d41649247747a499e459ef33cc7c76e0/R/with_nix.R#L284-L292 We also have docs for it.

https://docs.ropensci.org/rix/articles/z-advanced-topic-running-r-or-shell-code-in-nix-from-r.html

We do it via {sys} and have some safe defaults to run code it different nix shells, with proper recursive detection of globals etc. The approach really works well and I don't think it's necessary to have duplicate functionality.

For functionality under the hood, see https://github.com/ropensci/rix/blob/main/R/with_nix_helpers.R

Cheers, Philipp

jrosell commented 2 weeks ago

Thanks, Philipp. I tested it a bit and I get some weird results with this approach. I assume it's because it doesn't make sense to benchmark with less than 10s precision with this implementation.

benchmark_dummy <- \(){
  invisible(NULL)
}
benchmark_memCompress <- \(){  
  txt <- readLines(file.path(R.home(), "COPYING"))
  for(i in 1:100) {    
    memCompress(txt, "g")
  }
  invisible(NULL)
}
results_r <- bench::mark(
  dummy = {    
    benchmark_dummy()
  },
  memCompress ={  
    benchmark_memCompress()
  },
  check = FALSE,
  memory = FALSE,
  min_time = 10
)
results_r[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression    median
#>   <bch:expr>  <bch:tm>
#> 1 dummy        250.1ns
#> 2 memCompress   43.4ms

# Configuring and initial set up of the two environments
rix::rix(r_ver = "3.6.3", project_path = "/tmp/R/3.6.3", overwrite = TRUE)
rix::with_nix(benchmark_dummy, project_path = "/tmp/R/3.6.3", program = "R")
rix::rix(r_ver = "latest", project_path = "/tmp/R/latest", overwrite = TRUE)
rix::with_nix(benchmark_dummy, project_path = "/tmp/R/latest", program = "R")

# Get the fastest time
results_dummy <- bench::mark(
  old_dummy = {    
    rix::with_nix(benchmark_dummy, project_path = "/tmp/R/3.6.3", program = "R")
  },
  new_dummy ={ 
    rix::with_nix(benchmark_dummy, project_path = "/tmp/R/latest", program = "R")
  },
  check = FALSE,
  memory = FALSE,
  min_time = 30
)
results_dummy[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression   median
#>   <bch:expr> <bch:tm>
#> 1 old_dummy     5.05s
#> 2 new_dummy     7.05s

# Get the bechmark times
results_memCompress <- bench::mark(
  old_memCompress = {    
    rix::with_nix(benchmark_memCompress, project_path = "/tmp/R/3.6.3")
  },
  new_memCompress ={ 
    rix::with_nix(benchmark_memCompress, project_path = "/tmp/R/latest")
  },
  check = FALSE,
  memory = FALSE,
  min_time = 30
)
results_memCompress[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression        median
#>   <bch:expr>      <bch:tm>
#> 1 old_memCompress    5.69s
#> 2 new_memCompress    8.36s
philipp-baumann commented 2 weeks ago

Thanks, Philipp. I tested it a bit and I get some weird results with this approach. I assume it's because it doesn't make sense to benchmark with less than 10s precision with this implementation.

benchmark_dummy <- \(){
  invisible(NULL)
}
benchmark_memCompress <- \(){  
  txt <- readLines(file.path(R.home(), "COPYING"))
  for(i in 1:100) {    
    memCompress(txt, "g")
  }
  invisible(NULL)
}
results_r <- bench::mark(
  dummy = {    
    benchmark_dummy()
  },
  memCompress ={  
    benchmark_memCompress()
  },
  check = FALSE,
  memory = FALSE,
  min_time = 10
)
results_r[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression    median
#>   <bch:expr>  <bch:tm>
#> 1 dummy        250.1ns
#> 2 memCompress   43.4ms

# Configuring and initial set up of the two environments
rix::rix(r_ver = "3.6.3", project_path = "/tmp/R/3.6.3", overwrite = TRUE)
rix::with_nix(benchmark_dummy, project_path = "/tmp/R/3.6.3", program = "R")
rix::rix(r_ver = "latest", project_path = "/tmp/R/latest", overwrite = TRUE)
rix::with_nix(benchmark_dummy, project_path = "/tmp/R/latest", program = "R")

# Get the fastest time
results_dummy <- bench::mark(
  old_dummy = {    
    rix::with_nix(benchmark_dummy, project_path = "/tmp/R/3.6.3", program = "R")
  },
  new_dummy ={ 
    rix::with_nix(benchmark_dummy, project_path = "/tmp/R/latest", program = "R")
  },
  check = FALSE,
  memory = FALSE,
  min_time = 30
)
results_dummy[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression   median
#>   <bch:expr> <bch:tm>
#> 1 old_dummy     5.05s
#> 2 new_dummy     7.05s

# Get the bechmark times
results_memCompress <- bench::mark(
  old_memCompress = {    
    rix::with_nix(benchmark_memCompress, project_path = "/tmp/R/3.6.3")
  },
  new_memCompress ={ 
    rix::with_nix(benchmark_memCompress, project_path = "/tmp/R/latest")
  },
  check = FALSE,
  memory = FALSE,
  min_time = 30
)
results_memCompress[,c("expression", "median")]
#> # A tibble: 2 × 2
#>   expression        median
#>   <bch:expr>      <bch:tm>
#> 1 old_memCompress    5.69s
#> 2 new_memCompress    8.36s

Yes, exactly, it doesn't make sense to benchmark, because there is a serialization/deserialization overhead (including detecting and assigning globals recursively before), the time to invoke nix-shell (which is known its relatively slow as packaged in NixCpp).

philipp-baumann commented 2 weeks ago

I have currently on my aarch64 MacbookM2 about 2.5s median time (my rocky linux in my home network is currently disconnected from ssh access). Had to switch to to microbenchmark::microbenchmark() because bench::mark() errored with a file unlinking problem, and also i just test dummy in "latest" R because back then that arch did not exist on nixpkgs. But the 2.5 seconds I got would match also a similar benchmarking overhead between haskell build tool and nix-shell invocation: https://github.com/commercialhaskell/stack/issues/4406

benchmark_dummy <- \(){
  invisible(NULL)
}

benchmark_memCompress <- \(){
  txt <- readLines(file.path(R.home(), "COPYING"))
  for (i in 1:100) {    
    memCompress(txt, "g")
  }
  invisible(NULL)
}

results_r <- bench::mark(
  dummy = {    
    benchmark_dummy()
  },
  memCompress ={  
    benchmark_memCompress()
  },
  check = FALSE,
  memory = FALSE,
  min_time = 10
)

r_latest_path <- file.path("latest")
r_3_6_3_path <- file.path("3.6.3")

results_r[, c("expression", "median")]

# Configuring and initial set up of the two environments
# R 3.6.3 is not available for aarch64-darwin,will not build because at that
# time nixpkgs was not yet supporting the Apple Silicon architecture
# rix::rix(r_ver = "3.6.3", project_path = r_3_6_3_path, overwrite = TRUE)
# rix::nix_build(project_path = r_3_6_3_path)

rix::rix(r_ver = "latest", project_path = r_latest_path, overwrite = TRUE)
rix::nix_build(project_path = r_latest_path)

# Get the fastest time
results_dummy <- bench::mark(
  # old_dummy = {   
  #   rix::with_nix(benchmark_dummy, project_path = "/tmp/R/3.6.3", program = "R")
  # },
  new_dummy ={ 
    rix::with_nix(benchmark_dummy, project_path = r_latest_path, program = "R")
  },
  check = FALSE,
  memory = FALSE,
  filter_gc = FALSE,
  min_time = 10
)

benchmark_new <- microbenchmark::microbenchmark(
  new_dummy ={ 
    rix::with_nix(benchmark_dummy, project_path = r_latest_path, program = "R")
  },
  times = 20
)

Where i get

> benchmark_new
Unit: seconds
      expr      min     lq    mean   median       uq      max neval
 new_dummy 2.379217 2.4867 2.54477 2.503232 2.584663 2.785343    20

whatever it will be, you will have the overhead of nix-shell, which is significant, when you launch all from the same session. Otherwise you can just open two nix-R sessions in different subfolders and just run the same R scripts for benchmarking in separate R environments.

jrosell commented 2 weeks ago

I'm not sure if I understand well what you said in the last paragraph. Do you mean to run two separate benchmarks in two diferent scripts? I think I can try it with my rix-run tool. It could make sense.