r-lib / pak

A fresh approach to package installation
https://pak.r-lib.org
643 stars 56 forks source link

Suggestion: dependency order to package dependency table #511

Open drmowinckels opened 1 year ago

drmowinckels commented 1 year ago

I love pak! Its solving a real issue I have in a package for my work. But its missing a small thing for it to be my holy grail.

So, I have a package that downloads the source file of a package and source files for its dependencies, to prepare a zip file for import to an airgapped environment where the packages can be installed. pak creates a really lovely basis for this through pak::pkg_download .

The only thing I am missing, is somehow to know which order the packages should be installed in to resolve dependency issues as installing.

During pak::pkg_deps_tree you construct a tree https://github.com/r-lib/pak/blob/main/R/package.R#L267

but then return the data without any indication of the order they depend on each other to help with something like this.

This could potentially solves by another column to the package dep data.frame, though it already is quite large, or if it was returned in their order to be installed, to resolve such issues?

gaborcsardi commented 1 year ago

With a bit of work you can use the dependency information in the deps column:

❯ dl <- pak::pkg_download("dplyr", dest_dir = "/tmp/dl", platforms="source", dependencies = NA)
ℹ No downloads are needed, 16 pkgs (6.04 MB) are cached

❯ dl$deps[[1]]
# A data frame: 23 × 5
   ref         type     package     op    version
 * <chr>       <chr>    <chr>       <chr> <chr>
 1 R           depends  R           ">="  "3.4"
 2 callr       suggests callr       ""    ""
 3 covr        suggests covr        ""    ""
 4 crayon      suggests crayon      ""    ""
 5 digest      suggests digest      ""    ""
 6 glue        suggests glue        ">="  "1.6.0"
 7 grDevices   suggests grDevices   ""    ""
 8 htmltools   suggests htmltools   ""    ""
 9 htmlwidgets suggests htmlwidgets ""    ""
10 knitr       suggests knitr       ""    ""
# ℹ 13 more rows
# ℹ Use `print(n = ...)` to see more rows

You need to use the keep the rows where type is "suggests" or "enhances", where package is R or a base package (see pak:::base_packages()). The rest defines the DAG corresponding to the order(s) of installation.

OTOH, if you are installing (a subset of) packages, you can still use pak to install them. You can create a mini repository on the disk, by calling tools::write_PACKAGES() in the download directories, then add this repo using a file:// URL, and pak will happily (and fastly) install the packages from disk.

gaborcsardi commented 1 year ago

Btw. there is usually a large number of possible installation orders, and pak never actually creates such an order because it installs packages in parallel. But if that's helpful I can add a column to the output that defines a fixed installation order that you can run sequentially.

drmowinckels commented 1 year ago

OTOH, if you are installing (a subset of) packages, you can still use pak to install them. You can create a mini repository on the disk, by calling tools::write_PACKAGES() in the download directories, then add this repo using a file:// URL, and pak will happily (and fastly) install the packages from disk.

This is a brilliant idea. I'll give that a go and see if I cant figure it out!

drmowinckels commented 1 year ago

Trying to get this working but hitting some things I am I little uncertain of.

the argument r_versions is in plural, making me think I should be able to provide a vector of R versions to download for. But when I do this I get errors:

> pak::pkg_download(
    "psych",
     dest_dir ="test_pak",
     dependencies = NA,
     r_versions = c("4.3", "4.1", "4.4"),
     platforms = c("source", "windows")
)
✔ Updated metadata database: 5.38 MB in 37 files.                         
ℹ R 4.1 i386+x86_64-w64-mingw32 packages are missing from Bioconductor     
ℹ R 4.4 i386+x86_64-w64-mingw32 packages are missing from Bioconductor     
ℹ R 4.3 i386+x86_64-w64-mingw32 packages are missing from Bioconductor     
✖ Updating metadata database ... failed  

But running "4.4" in singular works.

pak::pkg_download(
     "psych",
     dest_dir ="test_pak",
     dependencies = NA,
     r_versions = c("4.4"),
     platforms = c("source", "windows")
)
✔ Updated metadata database: 2.14 MB in 10 files.                         
ℹ R 4.4 i386+x86_64-w64-mingw32 packages are missing from Bioconductor     
✔ Updating metadata database ... done                                     
ℹ Getting 4 pkgs (7.76 MB), 6 (4.53 MB) cached                             
✔ Got mnormt 2.1.1 (x86_64-w64-mingw32) (179.34 kB)                                            
✔ Got lattice 0.21-8 (x86_64-w64-mingw32) (1.36 MB)                                    
✔ Got nlme 3.1-162 (x86_64-w64-mingw32) (2.34 MB)                             
✔ Got psych 2.3.3 (i386+x86_64-w64-mingw32) (3.87 MB)                 
✔ Downloaded 4 packages (7.76 MB)in 6.6s   

having the version option is great, as the air-gapped machines usually lag behind in R versions, or might have projects bound to modules with specific versions for reproducibility.

so, for the last there I at least get a folder with data, win and src stuff. Looking good. Now, if I try running tools::write_PACKAGE("test_pak"), nothing seems to happen. No new files, console returns nothing.

Runinng tools::write_PACKAGES("test_pak", subdirs = TRUE) does create the PACKAGES files :tada:

But, I still cant seem to use that when installing

> install.packages("psych", repos = file.path("file:/","test_pak"), type = "source")
Error in read.dcf(file = tmpf) : cannot open the connection
In addition: Warning message:
In read.dcf(file = tmpf) :
  cannot open compressed file '/Users/athanasm/workspace/test_pak/src/contrib/PACKAGES', probable reason 'No such file or directory'

And yeah, the PACKAGES files are in the root of the folder I provide, not in their subdirs. So, I need to make the PACKAGES files in each of the contrib folders, is that correct? I mean, that makes it work, but it was unexpected (to me). I thought it would be on the root of the folder, and the install would (should) choose installing from either source or binary given the OS? Like it would from CRAN?

Sorry some of these questions are not directly pak things, just trying to wrap my head around how this all works so I can make this package and improve my own and colleagues work in the airgapped server.

gaborcsardi commented 1 year ago

the argument r_versions is in plural, making me think I should be able to provide a vector of R versions to download for. But when I do this I get errors:

Yes, this is a bug. It should work, but it is poorly tested currently. I'll open an issue for it.

So, I need to make the PACKAGES files in each of the contrib folders, is that correct?

Yes, that's how CRAN organizes their directories.

One thing I was thinking about is that pkg_download() might actually create the PACKAGES files. Or we could have a pkg_mirror() function that mirrors a subset of packages locally. This would be especially great if we fixed the bug about using multiple R versions.

drmowinckels commented 1 year ago

ok thanks. my head is starting to be wrapped :)

theoretically, if pkg_download() makes the PACKAGES files, will it then rerun to remake when updating/adding more packages to the same folder?

Historically, I have made a new zip file for each pacakge a user wants to get installed airgapped, with the recipe to install, and import them all and run sequentially. With pak I see the option allowing multiple packages to use the same system, reducing the size overall and making it more efficient. But that would mean that the PACKAGES files would need to get updated when its run every time, right? Which I guess is almost arbitrary, since it takes so little time to run.

I like having it wrapped in a single function like pkg_download, though I could see a cleaner use of it through a separate function like pak_mirror(). Thinking of it, I think maybe in the long run, pak_mirror might be a better option.

gaborcsardi commented 1 year ago

theoretically, if pkg_download() makes the PACKAGES files, will it then rerun to remake when updating/adding more packages to the same folder?

Yeah, you are right that that API would not be very intuitive for pkg_download().

Something like pkg_mirror() or repo_mirror() or whatever it is called, could update the PACKAGES* files.

Which I guess is almost arbitrary, since it takes so little time to run.

Currently it decompresses all package files to read the metadata, so it is not that little, actually, if you have a lot of packages.

OTOH, I understand that having a single zip file instead of a local repository has a lot of value for your use case, so maybe we could have a better solution for that first. E.g. have a pkg_bundle() function that zips up everything, and installs a bundle. With a bundle format that only requires an unzip.

But in any case, adding a column for one possible installation order is still fine. Or ordering that data frame according to a legal installation order is also an option.

drmowinckels commented 1 year ago

OTOH, I understand that having a single zip file instead of a local repository has a lot of value for your use case, so maybe we could have a better solution for that first. E.g. have a pkg_bundle() function that zips up everything, and installs a bundle. With a bundle format that only requires an unzip.

To be honest, yes, if you are interested in catering for this scenario in pak directly, that would be amazing! The single zip file is just for easy port into the airgapped environment, and a simple single file to point to to start the install process once inside.

If you are interested, my current dev version of this package is here. Broken right now, but making progress on porting things to pak. But the README alone might help you understand where I have been and where I want to go with this .

having a pkg_bundle() would be really neat, and be very convenient. So lets say you have it bundled, and then want to run eg. repo_mirror(), it could be convenient if that could run on an already bundled folder, else we'll also need a pkg_unbundle() before making a mirror?

gaborcsardi commented 1 year ago

I think pkg_bundle() would have to work w/o a repo, and always re-bundle everything. Downloads are cached, so this is not that bad.

repo_mirror() would be completely separate, and it would indeed update the repo properly. I could imagine a repo_bundle() function that starts with the repo, and bundles (part of?) it up, though.

gaborcsardi commented 1 year ago

@drmowinckels So, currently, pkg_download() will download all builds of the packages, e.g. if a package has a binary and a source build, it will download both (by default).

This makes the installation order tricky, because if the binary and the source builds are of different versions, then their installation order might be different. In fact, pak does not even run the dependency solver for pkg_download().

Do you download source or binary packages? Do you use the platforms argument of pkg_download() to specify which ones you want?

drmowinckels commented 1 year ago

Thanks for the thorough information, I'm learning a lot!

Initially, I only ever got the source files, as I singularly had our RedHat virtual environment in mind. Later, I have seen more and more issues being posted by people using the package for the windows VMs as well. So I was thinking of exposing the platforms argument to have the user resolve this (alternatively give them a similar option to choose "windows" or "linux" and then provide the correct corresponding argument to pak for them).