grackle-project / grackle

The Grackle chemistry and cooling library for astrophysical simulations and models.
Other
25 stars 50 forks source link

Distributing ``Pygrackle`` (and ``Grackle``, itself) #220

Open mabruzzo opened 2 months ago

mabruzzo commented 2 months ago

Once #182 and #208 are merged, I think we will be in a good position to consider packaging (to let people install things without manually building from source). This issue primarily exists to get the discussion going.

This Issue primarily focuses on packaging/distributing Pygrackle. It also touches on the idea of packaging Grackle, itself. The latter idea is mostly discussed as a hypothetical idea -- it is primarily mentioned to highlight that some challenges are common in both scenarios (and that we may want to pick solutions that would help resolve the issue in both scenarios).

The biggest challenge in most scenarios is the fact that Grackle requires the hdf5 library. A lesser, common issue is convient distribution of datafiles (that was acknowledged in #210).

Packaging/Distributing Pygrackle

I think there are 2 primary things we want to do here:

  1. Distribute through conda-forge.
    • The hdf5-dependency should not be an issue in this scenario (one of the reasons to conda exists is to help address this sort of dependency)
    • 2 questions exists:
      1. Is there any advantage to creating separate conda packages for
      2. Is there any advantage to waiting until we start distributing on PyPI
  2. Distribute through pypi
    • the hdf5-dependency is an issue in this scenario. The proper way to distribute Grackle in its current state is to distribute it with a copy of the hdf5 library on every platform -- this is exactly what the likes of h5py and PyTable do
    • The way I see it, we have 3 options:
      1. Suck it up and distribute a copy of hdf5 with pygrackle
      2. We could refactor the Grackle to support a mode where we use the dlopen to dynamically load the hdf5 library at runtime. This is what the hdf5plugins package does with h5py
        • For the uninitiated, this is the mechanism that python actually uses to load in extension modules. It is also the mechanism used by the ctypes module
        • What I envision:
          • we would refactor all hdf5 function calls scattered throughout Grackle to make use of function pointers that we store in some kind of internal struct.
          • Then, during initialization, we would provide grackle with the locations of the hdf5 library. The initialization routine would then use dlopen with this information to open the hdf5 library at runtime and would initialize the function-pointers in that internal struct
          • Pygrackle in particular, would use the version of hdf5 that ships with h5py
        • The main disadvantage: I think that dlopen has some weird platform dependent behavior
      3. We could do something extremely similar to option 2, but we could make the cython-layer responsible for determining the function pointers (or providing wrappers around the function pointers). In other words, we wouldn't need to directly invoke dlopen
    • It's worth mentioning, that options 1 and 2 may be relevant, if we decide we want to distribute precompiled Grackle binaries

Packaging/Distributing Grackle

This is all a much lower priority than shipping Pygrackle, but it's worth briefly mentioning that we could also distribute precompiled Grackle binaries. I've seen this sort of thing done alongside new software releases for other open-source projects on GitHub.

Since we will already be doing a lot of work to ship Grackle as a precompiled binary alongside Pygrackle, I actually think this wouldn't be hard to do. Moreover, the new CMake build-system should be compatible with the CPack tool.

Again, this is a hypothetical low-priority idea. But, there are 2 reasons I bring this up:

  1. The hdf5-dependency problem crops up here. In light of that, it might pick a solution that addresses this case AND the Pygrackle case (instead of requiring 2 separate solutions).
  2. Data-files are also an issue here. Consequently, it might make sense to write the python code that we would ship in Pygrackle for fetching the datafiles in a portable way that could be reused later (maybe we could implement the functionality in a self-contained file with a if __name__ == __main__ section that lets the file be installed alongside Grackle as a self-contained, executable script that serves as a command-line tool for fetching data files?[^1]

[^1]: @matthewturk previously suggested using pooch. Using that module probably wouldn't be ideal for helping in the case of distribution Grackle (without Pygrackle). For portability, it might be better to rely upon functionality from the python standard-library, rather than the pooch package. OR since distributing Grackle itself is mostly a hypothetical, maybe we should use pooch and limit ourselves to a subset of pooch-functionality that we could re-implement later in terms of standard library functionality (if/when it becomes an issue).

mabruzzo commented 2 months ago

Upon reflection, I don't think the dlopen approach would work in a robust forwards compatible manner. The problem is that a number of macros are defined in headers that we need to invoke the HDF5 functions such as H5F_ACC_RDONLY, H5P_DEFAULT, H5S_ALL, H5T_C_S1, H5T_STD_I32BE, H5T_STD_I64BE, H5T_IEEE_F32BE, H5T_IEEE_F64BE, H5T_STD_B8BE, H5T_NATIVE_INT, H5T_NATIVE_LLONG, H5T_NATIVE_FLOAT, H5T_NATIVE_DOUBLE, H5T_NATIVE_LDOUBLE. These are all API-stable, but not ABI stable -- the associated underlying values are technically allowed to change from 1 version to the next. [^1]

Consequently, resolving the hdf5-distribution problem actually has the following solutions:

  1. We don't try to resolve the issue. Instead we simply upload a sdist (source distributions) to PyPI (pip install pygrackle will download the sdist and will try to launch grackle and pygrackle locally)
    • if we wanted to be able to ship Grackle itself without resolving the issue, we would probably need to move all hdf5-related code out of the source file and put it into inline functions defined in the header file (consequently, compiling Grackle itself, wouldn't involve any HDF5 headers or linking)
  2. We simply bite the bullet and package hdf5 as part of pygrackle (and Grackle if we ever choose to distribute it)
  3. We adopt the solution from the original post where we pass in structs with function pointers (but don't use dlopen):
    • Pygrackle would query h5py to determine these function pointers (I believe h5py is written in a way that we could dynamically query the values associated with the various macro quantities)
    • We could ship a public header-file that includes the hdf5 header and provides an inline-function that constructs the struct of function pointers for the downstream application
    • EDIT: upon reflection, this approach introduces some challenges with using Fortran bindings. To do this, we probably need to ship a separate shared library (that directly links to hdf5) to use for this purpose. I'm very hesitant to do this (seems like more maintenance), but this actually isn't as crazy as it might sound -- if we want to easily support the "dynamic api" in Fortran, we would probably need to ship a separate shared library anyways

There are some other crazier solutions too (probably not worth going into)

For now I say we ship a SDist and punt on the rest of the hdf5 issues.

[^1]: In principle, we could maintain a small database where we record the values of each of these macros for each of hdf5's minor version (e.g. 1.6, 1.8, 1.10, 1.12, 1.14, etc.), but I don't really think we want to do that...

mabruzzo commented 2 months ago

Some additional questions:

As I talk this out, I posit we should ship as little as possible in a normal wheel. To experiment with shipping more, we should distribute Grackle and pygrackle separately on conda-forge

matthewturk commented 2 months ago

I'm late to this, and it seems that you have this largely under control. My main comment to make is that I agree with your assertion that dlopen is not a workable solution.

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

mabruzzo commented 2 months ago

(Note: this comment was heavily revised for clarity)

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

I'm happy to do that. That makes some other decisions easier.

As I understand it, (and please correct me if I'm wrong), the cythonize functionality needs to know where to find grackle_defs.pxd and any headers referenced therein. I think we would need to provide a function like numpy.get_include[^1] or mpi4py.get_include.

There is a wrinkle here. There are actually 2(ish) ways to build pygrackle with the new build-system and the locations of the Grackle header depends on which way is used. These approaches include

  1. an approach where we freshly compile Grackle and embed it inside the pygrackle repository. This is the default approach (if we ever upload pre-compiled binary wheels to PyPI, they would also be built this way)
  2. an approach where we link to an existing grackle-build (mostly for testing purposes). There are a couple of flavors to this approach: (i) link against a classic style build or (ii) linking against Grackle in a cmake-build-directory or (iii) linking against a Grackle installation produced by cmake. (From pygrackle's perspective, there is no difference between ii and iii)

The problem is that the Grackle headers will be in different locations for both cases (we can't just hardcode them into Pygrackle).

For concreteness, I'm talking about adding functionality that would allow an Extension Module compiled against Pygrackle to have a setup.py file that looks like:

import pygrackle
#...
Extension('my-extension-name', #...
          include_dirs=pygrackle.get_include_list())

At the moment, it's not clear the best way to do this sort of thing... If I generated a toml or ini file, do you know if this is generally a use-case for importlib.resources?

EDIT: it turns out scikit-build-core has some trouble right now with editable installs and importlib.resources. To implement this functionality, I would instead make a template-file called __config__.py.in and have the build-system inject build-data into this file. numpy and scipy do the exact same thing (they use meson functionality with a direct CMake counterpart).

[^1]: Aside: that webpage references the fact that they ship a pkg-config file for their numpy headers (i.e. numpy.pc). From inspecting the file, it looks like that doesn't provide much info. But, maybe we should also provide a pygrackle.pc file (especially since #182 exposes machinery for a grackle.pc file).

mabruzzo commented 1 month ago

I'm late to this, and it seems that you have this largely under control. My main comment to make is that I agree with your assertion that dlopen is not a workable solution.

I believe that we should ship grackle_defs.pxd, but I am willing to concede that this may not be universal. I have in the past found myself wanting .pxd files and I think I remember actually using the grackle pxd's in a separate project. However, that is likely to be the extreme end and I have no problem not using grackle wheels in that unlikely event.

@matthewturk I've thought about this some more, and now I think I can articulate my concerns a little better.

Some Examples:

Off the top of my head, I can name 3 python popular packages that include .pxd files inside their source directory and are linked against standalone software. I think they correspond to 3 representative cases:

First, there is numpy:

Next, there is mpi4py:

Finally, there is h5py:

Relating this to pygrackle

If the content's of Pygrackle's .pyd file(s) only referenced internal extension types, without mention of any grackle-headers, I would have 0 concern with including the .pyd file(s) in the wheel. In that case, we are effectively in the numpy scenario.

In reality, the contents of Pygrackle's .pyd directly wrap the grackle headers.

However, Pygrackle firmly lies in the h5py case. For that reason, we are in this weird in-between:

I guess I'm just worried about providing an inconsistent experience (when it comes to python-installation stuff I would like things to "just work"). I'm also a little worried that I may be missing other obvious cases (related to python packaging) where things could go wrong.

What do we do?

I would feel a lot more comfortable with all of this if we knew of a similar popular package (to the h5py scenario) that does distribute the pyd files in the wheel? Is anyone aware of one?

I'm also willing to accept that these concerns are overblown. One might argue that anyone who is actually linking an Extension Module should be able to understand all of these issues.[^1] Furthermore, unless someone is linking against an external classic-build, they need to go out of their way to delete/move the headers, while ensuring that Pygrackle continues to work.[^2]

A few other thoughts:

[^1]: with that said, this doesn't really help an arbitrary end-user of that Extension Module. [^2]: unless you really know what you're doing (in which case, you probably can understand these challenges), it seems extremely unlikely that you would want to build an external-grackle-copy, link that to pygrackle, and then delete everything other than pygrackle and the libgrackle.so. [^3]: For any standard package-manager situation, where pygrackle and grackle are distributed as separate packages, (e.g. perhaps the conda packages?) we would probably want ways to avoid raising the error-messages.

matthewturk commented 1 month ago

I think you make a compelling case that this is too complex to ship the pxds. I think that I am uncertain about the actual difficulties of distributing grackle via pypi, but I think it is worthwhile to say that linking requires extra effort on the part of the person doing the linking if they want to use the pypi package. I think going ahead with the simplest solution with the lowest probability of errors is the appropriate course.

mabruzzo commented 1 month ago

I wanted to record a few thoughts that I've had about this topic over the past few weeks (hopefully this is semi-coherent)

Area I: Distributing binary wheels from PYPI

Distributing binary wheels from PYPI may be a lot more doable than I originally thought! My primary concern has always been with cross-compiling hdf5 (I also have a much better understanding about the cross-compilation process than I used to)

I was concerned because the scripts used by h5py and pytables for this purpose are fairly complex (things seemed tricky in the julia packaging ecosystem as well)

With that said, it looks like a lot of effort has gone into making hdf5 version 1.14.y.z easier to cross-compile (e.g. see HDFGroup/hdf5#1203 and other issues linked therein)

Area II. What to include in a wheel

In light of the fact that we don't initially plan to ship the cython pxds and the general issues described in the next section related to using a precompiled grackle library for most simulations, I'm inclined to just ship the bare minimum in the wheels (at least to start). I'm definitely happy to revisit this (especially if somebody has a particular application in mind), but I think it makes sense to initially be conservative on this choice.

Area III. Obstacles to distributing precompiled libgrackle.[so|dylib|a] for general-use (e.g. for compiling a simulation code)

These obstacles are more obvious in hindsight, and I have already talked about some of this elsewhere in this thread. But I think it would be helpful to restate some of this for concreteness (especially now that I appreciate the full-scope of the issue).

General-use refers to using the pre-compiled library for compiling a simulation code.

The primary obstacle

The primary obstacle relates to the fact that a precompiled version of grackle needs to be linked against an ABI compatible version of it's runtime dependency hdf5 -- If you compile against hdf5 1.x.y.z, problems arise if you link against a library where some version-digit (other than z) has changed.[^1] (This is also true for the fortran runtime library, used by the internal routines, but that is less of an issue[^2])

Let's consider the simplest general-purpose grackle distribution mechanism that would work for the largest number of systems involves. Simply distribute a tarball containing:

In fact, this distribution mechanism would work for developing any pure-C/C++ downstream application that doesn't directly link to hdf5 or its dependencies (or link to dependencies that directly link to hdf5/hdf5's dependencies). It can also be used with a Fortran code that is compiled with a recent-ish gfortran version. In other words, it works with the examples shipped with Grackle or simulation codes that don't use an hdf5 output format.

Problems arise if you try to compile a downstream application that links against the precompiled grackle binary and links against an incompatible hdf5 version. In more detail, these problems arise because Linux and most (all?) mainstream unix-like operating systems[^3] construct a global table of symbol-names and it's problematic to be linked two 2 libraries that provide conflicting definitions of a symbol with a given name.

Overcoming these obstacles

This issue is hardly unique to Grackle. There are 2 options without refactoring (but neither are particularly attractive):

  1. Include everything necessary in the tarball needed to have downstream applications compile against the included version of libhdf5. There are 2 issues:

    • I don't think we should be in the business of providing a full HDF5 distribution -- a lot could go wrong.
    • A lot of custom behavior may need to be injected into makefile build-systems of the downstream applications (it would be straight-forward if downstream applications use cmake, but there are convenient alternatives in those cases).
  2. Distribute Grackle via package managers like apt/dpkg, brew, conda[^4], dnf, etc. (i.e. we don't provide a tarball). In this scenario, we precompile grackle against the version of hdf5 provided by the package manager we would require downstream simulations would try to compile against that same hdf5 version. Disadvantages include:

    • it would be too much work for us to maintain a bunch of individual packages (realistically, we would probably just pick 1 or 2).
    • issues might arise with respect to parallel-hdf5[^5]

If we ultimately decide this is something worth doing, we could refactor the source code to better support this scenario. Currently Grackle only has 2 functions that directly reference the HDF5 API; initialize_UVbackground_data, and initialize_cloudy_data.[^6] Basically we could move the definitions to one or two different locations:

  1. We could move these 2 function definitions into a separate library -- let's call it libgrackle-h5tools.so. We would always ship libgrackle-h5tools.so alongside libgrackle.[so|a] and we would have libgrackle.so access these 2 functions by using dlopen, which lets us avoid any symbol collisions.[^7]

    • this is a solution that would be fairly portable to any posix-compliant system (on windows there is alternative, equivalent functionality).
    • Unlike the other mention of dlopen described earlier in this thread, this usage would be more robust since we always know the relative path of the shared library it is used to open
    • Advantage: The build-systems of simulation codes wouldn't need any modifications
    • EDIT: Actually, I think this gets a little messy if you compile grackle as a static library (related to specifying the path to the dlopened library)
    • Disadvantage: The way to specify the relative paths between the 2 directories may be somewhat platform-dependent, but I don't think addressing this would be terribly difficult.
    • Disadvantage: The naive/straightforward implementation would introduce an extra file-system for each process in a simulation that loads grackle (this may be meaningful on certain fragile parallel filesystems)...
  2. We could move these 2 function definitions into the public headers and make them inline-functions (so that they are defined within the binary of the simulation-code rather than in grackle). To be able to support fortran bindings, we would need to ship a second version of the grackle-library alongside the main one that fortran codes would explicitly link to (since the fortran codes can't explicitly define inline functions).

    • Advantage: This is guaranteed to be more portable across systems
    • Disadvantage: This requires modifications to the build-systems of most simulation codes that aren't built with cmake (if pkg-config is used, modifications may not be necessary). A fortran simulation code will need to change the -L linker flag. A c/c++ code that doesn't use hdf5 outside of grackle will need to pass a new -I flags to specify the hdf5 header locations and pass -lhdf5. Other c/c++ codes may also require similar changes
    • Disadvantage: The naive/straightforward implementation would force us to keep certain implementation details exposed (and would undermine efforts to hide these details and achieve a stable ABI).

[^1]: This is true whether you pre-compile grackle as a shared or static library. I have been [^2]: At one time, I was more concerned about linking libgfortran.so. But, I have since learned that libgfortran.so employs symbol versioning, like the standard c runtime library on linux (glibc), in order to mitigate ABI compatibility issues. If a downstream application contains Fortran-code, that Fortran code would need to be compiled with gfortran 8 or newer (released 6 years ago), which is reasonable. It's worth mentioning that openmp runtime libraries pose a larger problem than hdf5, but that is a different topic that we can ignore right now. [^3]: As I understand it, this isn't actually an issue on Windows [^4]: In this case, there would be separate grackle and pygrackle conda packages [^5]: I don't know enough about the internals of hdf5, but since grackle doesn't depend on mpi, it's plausible this isn't a major issue. [^6]: This pending debug-tool introduces another one. [^7]: The CPython interpretter uses dlopen to open extension modules. This is why we can currently use versions of h5py and pygrackle together that were compiled against different (incompatible hdf5 versions).