microsoft / sqlmlutils

Utility functions for easier usage of SQL Server Machine Learning Services
Other
32 stars 33 forks source link

Improve R Package Dependency Resolution and Script Execution Time #99

Closed seantleonard closed 2 years ago

seantleonard commented 2 years ago

Why is this change being made?

Mentioned in #95, script execution time can be slow. For R packages, the utility included LinkingTo packages when resolving dependencies for binary packages. Consequentially, LinkingTo packages (i.e. BH and Rcpp where many header files exist) can have large files counts, which contribute to longer execution times. Longer because there many more files which need permissions applied when SQL Server uses the launchpad service to instantiate an external script session contained in an AppContainer.

What does this change do?

When the utility calculates package dependencies, it gets type-specific URL paths (binary and source) to the configured CRAN repos and compiles a list of available packages to install. The available binary and source package lists are combined into one list and are now joined such that the binary package is kept when a package exists as an entry in both the binary and source lists.

The above change is sufficient in isolation if a user only asks for one package to be installed via sql_install.packages(). Now, the utility will iterate over each package requested and properly resolve dependencies (whether to include LinkingTo package dependencies) based on whether the desired package is available as binary, or only as source.

The function tools::package_dependencies() reference argument which includes the LinkingTo packages in the dependency calculation by default (RDocumentation). The utility now determines whether a package is available as source or binary, and populates the which argument accordingly.

How is this change tested?

Two tests are added to validate only the appropriate packages are installed.

  1. Binary Package install with LinkingTo dependency ensures no LinkingTo packages are installed if the package has an available binary.
  2. Source Package install with LinkingTo dependency ensures the LinkingTo package dependencies are resolved and are made available for reference when a package is built from source.
Aniruddh25 commented 2 years ago

With this fix, the unneeded packages will no longer be installed so the client scripts which might be depending on those packages might start to fail without them realizing it. So, I think we should update the sqlmlutils minor version at least and document the new behavior when we release it with this fix.

seantleonard commented 2 years ago

Which tests verifies that if there are packages with same name that it chooses binary?

The test Binary Package install with LinkingTo dependency tests this end to end. I've added a clarifying comment to the test case to describe how this test case is fulfilled.

'iptools' is available as source and binary. This test validates that the LinkingTo package 'BH' is not installed. If 'BH' is installed, that means that the 'iptools' source package was chosen, because LinkingTo packages are required when building from source.