vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

Support CPython 3.11, 3.12, and aarch64 processors #2331

Open ddelange opened 1 year ago

ddelange commented 1 year ago

Hoi 👋

linux-aarch64 makes up for almost 10% of all platforms ref https://github.com/giampaolo/psutil/pull/2103

aarch64 has already surpassed windows in terms of downloads for this package. Oracle, Amazon, Google, and Microsoft are all offering aarch64 cloud instances at an undeniable price point compared to amd/intel, so the demand will undoubtedly only grow

the wheels from this PR can be installed with:

# comma separated list for --find-links
export PIP_FIND_LINKS=https://github.com/ddelange/vaex/releases/expanded_assets/core-v4.17.1.post4
pip install --force-reinstall vaex

fixes #2366, fixes #2368, fixes #2397

maartenbreddels commented 1 year ago

Hoi 👋

exciting, will take a look early next week!

  • manylinux takes around 2.5hrs per wheel and alpine arm64 up to 4 hrs

that worries me a bit.. :)

groeten,

Maarten

ddelange commented 1 year ago

here are all timings: https://github.com/ddelange/vaex/actions/runs/3965720337/usage

depending on how often a month you release vaex, this could eat into the 2k free minutes of GH...

as the parallelization is maximised and they're pushed to PyPI as soon as they're built, most of the wheels will be available soon upon release regardless

here are all the wheels: distributions.zip

ddelange commented 1 year ago

interestingly, that was 8260 minutes ^

apparently that's OK? then I don't understand their explanation 🤔 https://docs.github.com/en/billing/managing-billing-for-github-actions/about-billing-for-github-actions#included-storage-and-minutes

ddelange commented 1 year ago

ah there is a fair amount of duplication in that usage table for whatever reason 🤯

ddelange commented 1 year ago

a diff of current PyPI vs the zip above:

 vaex_core-4.16.1-cp310-cp310-macosx_10_9_x86_64.whl
 vaex_core-4.16.1-cp310-cp310-macosx_11_0_arm64.whl
-vaex_core-4.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+vaex_core-4.16.1-cp310-cp310-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp310-cp310-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp310-cp310-musllinux_1_1_aarch64.whl
 vaex_core-4.16.1-cp310-cp310-musllinux_1_1_x86_64.whl
 vaex_core-4.16.1-cp310-cp310-win_amd64.whl
+vaex_core-4.16.1-cp311-cp311-macosx_10_9_x86_64.whl
+vaex_core-4.16.1-cp311-cp311-macosx_11_0_arm64.whl
+vaex_core-4.16.1-cp311-cp311-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp311-cp311-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp311-cp311-musllinux_1_1_aarch64.whl
+vaex_core-4.16.1-cp311-cp311-musllinux_1_1_x86_64.whl
+vaex_core-4.16.1-cp311-cp311-win_amd64.whl
 vaex_core-4.16.1-cp36-cp36m-macosx_10_9_x86_64.whl
-vaex_core-4.16.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+vaex_core-4.16.1-cp36-cp36m-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp36-cp36m-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp36-cp36m-musllinux_1_1_aarch64.whl
 vaex_core-4.16.1-cp36-cp36m-musllinux_1_1_x86_64.whl
 vaex_core-4.16.1-cp36-cp36m-win_amd64.whl
 vaex_core-4.16.1-cp37-cp37m-macosx_10_9_x86_64.whl
-vaex_core-4.16.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+vaex_core-4.16.1-cp37-cp37m-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp37-cp37m-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp37-cp37m-musllinux_1_1_aarch64.whl
 vaex_core-4.16.1-cp37-cp37m-musllinux_1_1_x86_64.whl
 vaex_core-4.16.1-cp37-cp37m-win_amd64.whl
 vaex_core-4.16.1-cp38-cp38-macosx_10_9_x86_64.whl
 vaex_core-4.16.1-cp38-cp38-macosx_11_0_arm64.whl
-vaex_core-4.16.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+vaex_core-4.16.1-cp38-cp38-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp38-cp38-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp38-cp38-musllinux_1_1_aarch64.whl
 vaex_core-4.16.1-cp38-cp38-musllinux_1_1_x86_64.whl
 vaex_core-4.16.1-cp38-cp38-win_amd64.whl
 vaex_core-4.16.1-cp39-cp39-macosx_10_9_x86_64.whl
 vaex_core-4.16.1-cp39-cp39-macosx_11_0_arm64.whl
-vaex_core-4.16.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
+vaex_core-4.16.1-cp39-cp39-manylinux_2_28_aarch64.whl
+vaex_core-4.16.1-cp39-cp39-manylinux_2_28_x86_64.whl
+vaex_core-4.16.1-cp39-cp39-musllinux_1_1_aarch64.whl
 vaex_core-4.16.1-cp39-cp39-musllinux_1_1_x86_64.whl
 vaex_core-4.16.1-cp39-cp39-win_amd64.whl
ddelange commented 1 year ago

I'm guessing this is blocked by https://github.com/vaexio/vaex/pull/2339

maartenbreddels commented 1 year ago

Just letting you know i'm very busy and had a vacation. Yes, I'll try to get https://github.com/vaexio/vaex/pull/2339 green first!

ddelange commented 1 year ago

fwiw there are now third party free minutes on native arm64 machines, to get rid of the slow qemu builds

maartenbreddels commented 12 months ago

Could you try rebasing this?

ddelange commented 12 months ago

@maartenbreddels already merged in master 👍

ddelange commented 12 months ago
    ERROR: Could not find a version that satisfies the requirement vaex-core<4.17,>=4.17.0 (from vaex)
    ERROR: No matching distribution found for vaex-core<4.17,>=4.17.0
maartenbreddels commented 12 months ago

Yeah, a bug/artifact or our release script. Should be good now.

ddelange commented 11 months ago

hoi @maartenbreddels 👋

I pulled master and fixed merge conflicts, but it looks like CI is still not very happy. Seeing errors like hdf file missing on disk, and TypeError: train() got an unexpected keyword argument 'early_stopping_rounds'.

Do you think it might be related to this PR?

franz101 commented 10 months ago

Just wondering here on the Python packaging. Python 3.6 and 3.7 are now deprecated on the other hand we can bump to 3.10 and 3.11?

to-bee commented 10 months ago

Do we have any updates on this MR?

ddelange commented 10 months ago

HI @maartenbreddels 👋

Was your s3 account deleted by any chance?

vaex.open('s3://vaex/taxi/yellow_taxi_2009_2015_f32.hdf5?anon=true')

raises

FileNotFoundError: [Errno 2] Path does not exist 'vaex/taxi/yellow_taxi_2009_2015_f32.hdf5'. Detail: [errno 2] No such file or directory
image
ddelange commented 9 months ago

As of October 2nd, Python 3.12 is in general availability. Might as well include it here? cibuildwheel should start building the wheels automatically now that cp312 is GA (it parses vaex's python_requires), so no additional action is needed probably. Some dependencies might lack 3.12 wheels as of now, so users would build them from source.

setu4993 commented 8 months ago

Hey folks, what's the ETA on this one? I see it's been on and off for ~9 months now. Would be great to have Python 3.11 support.

ddelange commented 8 months ago

@maartenbreddels we might have to drop support for cp36 and cp37 cp37-musllinux_aarch64.log.txt edit: failures are only for cp37-musllinux_aarch64 and cp38-musllinux_aarch64.

meanwhile, I've added two commits above to upload wheels as github release assets on my fork.

so now, the wheels from this PR can be installed with:

pip install vaex --force-reinstall --find-links https://github.com/ddelange/vaex/releases/expanded_assets/core-v4.17.1.post4

please report issues back here :)

ddelange commented 8 months ago

3.12 wheel build is not happy yet, and the traceback isn't really helpful here: cp312-manylinux_x86_64.txt

Henkhogan commented 8 months ago

3.12 wheel build is not happy yet, and the traceback isn't really helpful here: cp312-manylinux_x86_64.txt

I guess the same problems like here: https://stackoverflow.com/questions/77274572/multiqc-modulenotfounderror-no-module-named-imp

ddelange commented 8 months ago

Hi @Henkhogan 👋

Great catch! vaex still uses the imp module to load the version, which was deprecated in py3.12. Let me fix that :)

ddelange commented 8 months ago

See commit above. Wheels are building now, let's see

ddelange commented 8 months ago

fwiw @maartenbreddels the official (PyPA) way of using git tags is by switching from setup.py to pyproject.toml, and adding setuptools_scm and (in the case of this repository) using the tag_regex param to get the git tag of the corresponding subpackage ref https://setuptools-scm.readthedocs.io/en/latest/config/#configuration-parameters.

Here's a reference PR, including dynamically populating the __version__ variable in __init__.py

ddelange commented 8 months ago

cp312 wheels coming online 🎉

updated the pip install link in my earlier comment

EwoutH commented 8 months ago

Thanks a lot for this effort! What's needed to merge this PR, get a new release tagged and Python 3.12 wheels uploaded to PyPI?

ddelange commented 8 months ago

@EwoutH CI is still failing due to https://github.com/vaexio/vaex/pull/2331#issuecomment-1702344845

EwoutH commented 8 months ago

Can we (temporarily) host these files somewhere else? Maybe even here on GitHub, or as a gist?

ddelange commented 8 months ago

@maartenbreddels do you have the CI files somewhere?

longmathemagician commented 7 months ago

What's the path to getting this merged? There's a lot downstream being blocked here.

to-bee commented 6 months ago

Any news here?

ddelange commented 6 months ago

looks like the author is busy:) in the meantime, you can use the pip install command in the PR description, or add the --find-links ... part from it to your own pip install command, or to a separate line in a requirements file that you pass to pip install -r.

franz101 commented 5 months ago

we miss you @maartenbreddels

maartenbreddels commented 4 months ago

we miss you @maartenbreddels

Thank you. I have indeed been very busy (mostly on https://github.com/widgetti/solara/ ) but I do like to keep maintaining vaex at a minimum.

I'm gonna do my best to get this PR in

Was your s3 account deleted by any chance?

I did move the files internally in the bucket, because the aws-s3 bill was getting large (maybe some external CI's running on this file as well). I'll try to fix this so that CI at least runs green.

I do like to keep Python 3.6 and 3.7 in if possible, depending on the amount of work that is required.

maartenbreddels commented 4 months ago

We got some code rot in vaex-ml, @JovanVeljanoski would be great if you can take a look as vaex-ml expert :)

ddelange commented 4 months ago

hoi @maartenbreddels :wave:

I do like to keep Python 3.6 and 3.7 in if possible, depending on the amount of work that is required.

they're not deprecated, you can see the cp36 and cp37 wheels at https://github.com/ddelange/vaex/releases/expanded_assets/core-v4.17.1.post4

JovanVeljanoski commented 4 months ago

Pushed some changes that should fix the failing tests in vaex-ml

maartenbreddels commented 4 months ago

Thank you @JovanVeljanoski ! This is starting to look good, I need to fix those files that are missing now, I'm happy to fix that. The Python 3.6 and 3.7 failures with micromamba I could use some help with.

JovanVeljanoski commented 4 months ago

Looks like lightgbm>4. is not available via conda-forge for python < 3.8. I will attempt to install it via pip to see if that helps.

ddelange commented 4 months ago
base_url = 's3://vaex'

    @pytest.mark.slow
    @pytest.mark.parametrize("base_url", ["gs://vaex-data", "s3://vaex"])
    def test_cloud_glob(base_url):
>       assert set(vaex.file.glob(f'{base_url}/testing/*.hdf5', fs_options=fs_options)) >= ({f'{base_url}/testing/xys-masked.hdf5', f'{base_url}/testing/xys.hdf5'})
E       AssertionError: assert set() >= {'s3://vaex/testing/xys-masked.hdf5', 's3://vaex/testing/xys.hdf5'}
E        +  where set() = set([])
E        +    where [] = <function glob at 0x7f8447317f28>('s3://vaex/testing/*.hdf5', fs_options={'anonymous': 'true'})
E        +      where <function glob at 0x7f8447317f28> = <module 'vaex.file' from '/home/runner/work/vaex/vaex/packages/vaex-core/vaex/file/__init__.py'>.glob
E        +        where <module 'vaex.file' from '/home/runner/work/vaex/vaex/packages/vaex-core/vaex/file/__init__.py'> = vaex.file

tests/cloud_dataset_test.py:45: AssertionError
maartenbreddels commented 4 months ago

The hash issues are due to https://github.com/dask/dask/pull/10876 I think it's easier to pin dask to <2024.2.0 and keep doing that for a while and see what changes in the future (will they keep changing, or will they revert back to having the same result as before this release).

maartenbreddels commented 4 months ago

Getting greener, but seeing micromamba failing often, and hanging of tests on OSX.

ddelange commented 4 months ago

hmm, looks like micromamba is still flakey. maybe relevant? https://stackoverflow.com/a/77333269/5511061

ddelange commented 4 months ago

macos seems to be consistently hanging on https://github.com/vaexio/vaex/blob/master/tests/ml/cluster_test.py

any ideas there @JovanVeljanoski?

EwoutH commented 4 months ago

Can we make this more manageable by splitting it into multiple smaller PRs? Like:

I feel the size and complication of this PR now holds this effort back.

EwoutH commented 4 months ago

With #2417 and #2414 I started with two small steps.

franz101 commented 3 months ago

I compiled a list of all stable releases during the time the last build was working: https://github.com/vaexio/vaex/pull/2417#issuecomment-1985489149

I'm not sure which package is causing the hanging tests: I noticed we pinned pytest-async to 0.15 latest is (0.23.5) further catboost maybe needs to be pinned

to-bee commented 2 months ago

Hi there. Any plans to release this soonish? Really appreciated!

ddelange commented 2 months ago

@to-bee it would be a great help if you can install the wheels (see PR description) and report back your environment info + whether the wheels work in your environment!

to-bee commented 2 months ago

@ddelange yes sure. The wheels are working fine for me. Could install without any problems. python 3.12.3, Apple M1, ARM64_T6000 arm64, macOS 14.1.1