vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] vaex modules do not work in bazel builds #2365

Open anthonycorletti opened 1 year ago

anthonycorletti commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description

Vaex does not work when installed with Bazel.

Software information

Additional information I've attached a zip you can download and unzip to use to re-create the issue.

vaex_bazel_debug.zip

This zip contains the following

unzip vaex_bazel_debug.zip
cd vaex_bazel_debug
tree
.
├── BUILD.bazel         <= bazel build file      
├── WORKSPACE.bazel     <= bazel workspace file
├── main.py             <= main python file
└── requirements.txt    <= pip requirements file

After you've installed bazel 5.4.0, run the following to see the module error

$ bazel run //:main
INFO: Analyzed target //:main (83 packages loaded, 8613 targets configured).
INFO: Found 1 target...
Target //:main up-to-date:
  bazel-bin/main
INFO: Elapsed time: 2.749s, Critical Path: 1.43s
INFO: 4 processes: 4 internal.
INFO: Build completed successfully, 4 total actions
INFO: Build completed successfully, 4 total actions
/private/var/tmp/_bazel_anthcor/55ab504bf2b7b9104171188012e41fb8/execroot/__main__/bazel-out/darwin-fastbuild/bin/main.runfiles/my_deps_vaex_core/site-packages/vaex/__init__.py
['/private/var/tmp/_bazel_anthcor/55ab504bf2b7b9104171188012e41fb8/external/python_x86_64-apple-darwin/lib/python3.9/site-packages']
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_anthcor/55ab504bf2b7b9104171188012e41fb8/execroot/__main__/bazel-out/darwin-fastbuild/bin/main.runfiles/__main__/main.py", line 19, in <module>
    df.export_hdf5('df.hdf5')
  File "/private/var/tmp/_bazel_anthcor/55ab504bf2b7b9104171188012e41fb8/execroot/__main__/bazel-out/darwin-fastbuild/bin/main.runfiles/my_deps_vaex_core/site-packages/vaex/dataframe.py", line 6944, in export_hdf5
    from vaex.hdf5.writer import Writer
ModuleNotFoundError: No module named 'vaex.hdf5'

This is due to using init files for packages which bazel doesn't support because each package is separately created vs in site-packages style.

According to https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#creating-a-namespace-package:

It is extremely important that every distribution that uses the namespace package omits the init.py or uses a pkgutil-style init.py. If any distribution does not, it will cause the namespace logic to fail and the other sub-packages will not be importable.

maartenbreddels commented 1 year ago

Ouch. yes that is a 'risk' we took when doing this. It has always worked with pip. I knew we didn't completely follow standards, but as long as it worked, we kept doing it. Will be a chore to fix that to make it backwards compatible. We have to rename the vaex.hdf5 package to say vaex_hdf5, and add a vaex.hdf5 package with the same module names that do sth like

# file vaex/hdf5/foo.py
from vaex_hdf5.foo import *

# file vaex_hdf5/foo.py
...
<real content>
...
voxeljorge commented 1 year ago

We're also running into this and it is a bit of a blocker at the moment. One potential workaround would be to offer some kind of tooling to construct a custom wheel that holds all the files for a selected set of vaex packages.