qcscine / molassembler

Chemoinformatics toolkit with support for inorganic molecules
https://scine.ethz.ch/download/molassembler
BSD 3-Clause "New" or "Revised" License
33 stars 7 forks source link

Do not copy shared libs into Python modules #7

Closed awvwgk closed 1 year ago

awvwgk commented 2 years ago

I noticed that building the Python bindings of molassembler (and most other Scine modules), will try to copy needed shared libraries into the Python package. Copying the utilities module (~105MB) into every Python package is quite wasteful and defeats the point of dynamic linking.

Is there an easier way than patching the CMake build files to not copy shared libraries from outside the project into the Python package?

weymutht commented 2 years ago

I have troubles reproducing your huge package size. On PyPI, the entire scine-utilities package is a mere 6 MB, and even after extraction, it's only about 24 MB. How did you compile the Utilities module?

awvwgk commented 2 years ago

The conda package for Linux tend to be quite large (see https://anaconda.org/conda-forge/scine-utilsos/files), on OSX I don't have these issues. The build script is given here: https://github.com/conda-forge/scine-utilsos-feedstock/blob/main/recipe/build.sh.

CXXFLAGS are set to

+CXXFLAGS=-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem $PREFIX/include -fdebug-prefix-map=$SRC_DIR=/usr/local/src/conda/scine-utilsos-6.0.0 -fdebug-prefix-map=$PREFIX=/usr/local/src/conda-prefix

i.e. building codepaths for two microarchitectures.

weymutht commented 2 years ago

I see, this is a conda-specific thing, then. Unfortunately, I have only very limited experience with conda. Patching the CMake files should definitively lead to the desired result, but I'm not sure whether there is an easier way.

awvwgk commented 2 years ago

I don't understand what you mean with conda-specific? I'm building on a Linux machine (CentOS 6) with a standard GCC version (10.2.0), the main difference is that the resulting binary will be made relocatable (by RPATH replacement among other things).

Copying shared libraries is messing up dependency resolution done by conda-build to figure out which shared objects are needed, since the utilities library would appear again and requires full resolution of all symbols by providing the run requirements (direct rather than transient):

   INFO (scine-molassembler,lib/python3.8/site-packages/scine_molassembler/scine_molassembler.cpython-38-x86_64-linux-gnu.so): Needed DSO lib/libutilsos.so found in conda-forge::scine-utilsos-5.0.0-h99171d1_0
   ...
   INFO (scine-molassembler,lib/libmolassembler.so): Needed DSO lib/libutilsos.so found in conda-forge::scine-utilsos-5.0.0-h99171d1_0
   ...
   INFO (scine-molassembler,lib/python3.8/site-packages/scine_molassembler/libutilsos.so): Needed DSO lib/libboost_filesystem.so.1.74.0 found in conda-forge::boost-cpp-1.74.0-h75c5d50_8
  ERROR (scine-molassembler,lib/python3.8/site-packages/scine_molassembler/libutilsos.so): Needed DSO lib/libyaml-cpp.so.0.7 found in ['yaml-cpp']
  ERROR (scine-molassembler,lib/python3.8/site-packages/scine_molassembler/libutilsos.so): .. but ['yaml-cpp'] not in reqs/run, (i.e. it is overlinking) (likely) or a missing dependency (less likely)
   ...

Note that duplicated library is not actually used.

jan-grimo commented 2 years ago

As a little caveat, I have been removed from active development of most of these modules for a while and am sure to have forgotten some of the motivations and reasoning from each design decision. As best as I can remember:

Copying the utilities module (~105MB) into every Python package is quite wasteful and defeats the point of dynamic linking.

Well, for debug builds, yes, the libraries are quite large. However, we do want each python package do be independent. I.e. import scine_molassembler should work without requiring a preceding import scine_utilities. As a consequence, it is necessary that the utility functionality that e.g. molassembler needs is present.

I'm also not sure what you mean by defeating the point of dynamic linking. Yes, the utilities SO will be present in multiple python packages, but only one of those should ever be loaded into memory, which I would say fulfills the point of dynamic linking.

If you would prefer that we create a static utilities library and link that into the python packages to minimize the space required for our packages, I think that is something we cannot at present do since the utilities are not annotated with symbol visibilities. It is difficult to prioritize such work for little gain. And even in that case, one SO for our plugin architecture singleton would need to remain.

awvwgk commented 2 years ago

I'm also not sure what you mean by defeating the point of dynamic linking. Yes, the utilities SO will be present in multiple python packages, but only one of those should ever be loaded into memory, which I would say fulfills the point of dynamic linking.

Installing inside a package manager guarantees that the correct version is available. Conda-build will additionally make sure the correct RPATH is set for each shared module and only a single utilities module is needed for this reason.

If the molassembler package copies over a shared library, an update for the utilities module (minor or patch version bump) will not automatically propagate to the dependent package and requires a rebuild. From the rebuild effort this is identical to static linking.

jan-grimo commented 2 years ago

Fair point regarding SO substitution and updating. Conda is, however, not our only platform of concern. For our modules to function individually on PyPI, I think we will need to continue copying the required shared libraries into the packages, unless you have a suggestion on that front?

Is there something you would want us to change to make your particular intended use case easier?

awvwgk commented 2 years ago

I'm fine with patching things, especially if it only a single block of code, but I wouldn't mind a global option like -DSCINE_PYTHON_SO_COPY where I can turn this behavior off.

Note that other repositories, like Fedora or Debian, have similar or stricter policies regarding duplication of shared libraries.

awvwgk commented 1 year ago

How was this issue completed? Or was it just closed due to inactivity?

weymutht commented 1 year ago

I took the liberty of closing it after over one year of inactivity.

awvwgk commented 1 year ago

At least for the conda-forge packaging this is still an issue, which I am resolving by patching the distributed source. I understand that this is not a priority for this project.