vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Vaex uses more threads than it should / documentation is incomplete #2231

Closed vladmihaisima closed 1 year ago

vladmihaisima commented 1 year ago

Description The page https://vaex.readthedocs.io/en/latest/conf.html#thread-count specifies that by setting vaex.settings.main.thread_count, you set the thread count, while this in practice this happens only when setting the environment variable VAEX_NUM_THREADS.

This was observed during run on server that had large differences between actual number of processors (96) and number of threads allowed for the particular vaex program (via a batch system) which resulted in the process being killed, even though the program was setting vaex.settings.main.thread_count to a small number of threads.

This seems indeed to be the case by checking the code, thread_count_default is set at https://github.com/vaexio/vaex/blob/98b9f7924d57bf5cc75212182bf0d283d7ebc059/packages/vaex-core/vaex/multithreading.py#L21 but as far as I can tell that does not check at all the settings.

Software information

Additional information Code to reproduce:

import vaex
vaex.settings.main.thread_count = 2
vaex.settings.main.thread_count_io = 2

import vaex.dataframe
df = vaex.example()
print(f"vaex.dataframe.main_executor.thread_pool.nthreads={vaex.dataframe.main_executor.thread_pool.nthreads}")

results in (incorrect)

vaex.dataframe.main_executor.thread_pool.nthreads=12

While if you run

VAEX_NUM_THREADS=2 python3 test.py

you get (the correct):

vaex.dataframe.main_executor.thread_pool.nthreads=2
maartenbreddels commented 1 year ago

Thanks, this is fixed in https://github.com/vaexio/vaex/pull/2268 and should be out in the next release (4.15)