vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.29k stars 589 forks source link

apply function not running properly on large datasets #644

Open anubhav-nd opened 4 years ago

anubhav-nd commented 4 years ago

Hi,

I have a dataset with ~1 million rows and I want to run apply function on a column (not the best thing to do, but let's say we just want to do it). I am doing this the following way:

import vaex as vx
import hashlib

def get_hash(x):
    return hashlib.sha256(x.encode('utf-8')).hexdigest()

data = vx.open('somefile.arrow') # this file is ~ 500MB on disk
data['hash'] = data.apply(get_hash, arguments=[data.column1])
data = data.materialize('hash')
data.export('somefile.arrow')

when I try to access data.hash on the new file, I get the following error:

Error evaluating: ValueError('array is of length 262144, while the length of the DataFrame is 1179500')

The same steps work fine on datasets with much lesser no. of rows say ~100k.

The only reference to this problem that I could find was at https://github.com/vaexio/vaex/issues/355 but there it has been said that this problem is for datasets larger than memory of machine which I think will not be the situation in my case.

Am I doing something wrong here?

JovanVeljanoski commented 4 years ago

Hi @anubhav-nd wor I do not see anything wrong in your code example, in fact that should work. I have not been able to reproduce the issue locally: this code example works just find (10M rows), you can try it out yourself:

import vaex
import vaex.ml
import hashlib
import numpy as np

def get_hash(x):
    return hashlib.sha256(x.encode('utf-8')).hexdigest()

# Create a mock dataset
df_titanic = vaex.ml.datasets.load_titanic()

names = np.random.choice(df_titanic['name'].values, size=10_000_000)

df_mock = vaex.from_arrays(names=names)

# Hashing
df_mock['hash'] = df_mock.apply(f=get_hash, arguments=['names'])
df_mock = df_mock.materialize('hash')

# Export
df_mock.export_arrow('/Users/jovan/Desktop/mock.arrow')

If the above example works for you (the file created will be just under 1GB), that leads me to believe it is something to do with the arrow file itself. Looking at the error message, do you perhaps do some filtering, selections or slicing? The error message is very similar to doing this:

import vaex
import vaex.ml

df_titanic = vaex.ml.datasets.load_titanic()

df_filter = df_titanic[df_titanic.age > 45]

print(len(df_filter))   # The result here is 155

my_index = np.arange(155)
df_filter['my_index'] = my_index

which raises:

ValueError: Array is of length 155, while the length of the DataFrame is 155 due to the filtering, the (unfiltered) length is 1309.

Hope this helps.

anubhav-nd commented 4 years ago

hi @JovanVeljanoski,

Thanks for such a quick turnaround. Unfortunately, the example that you've shared is giving me the same problem. mock.arrow gets created without any issue, size on disk ~945MB. But when I open it as follows, I get the exact same error.

data = vx.open('mock.arrow')
data.hash

output is

Expression = hash
Length: 10,000,000 dtype: str (column)
--------------------------------------
Error evaluating: ValueError('array is of length 262144, while the length of the DataFrame is 10000000')

Something wrong with my vaex installation or settings?

anubhav-nd commented 4 years ago

Update: Tried the same thing on a linux machine. Getting the exact same error.

Systems Tried:

Mac OSX, Python 3.7.3 Ubuntu 16.04, Python 3.7.6

JovanVeljanoski commented 4 years ago

Hi,

Seems like an installation or dependency error perhaps. Can you tell me which version of pyarrow you have?

Also, can you try repeating the above exercise but exporting the file to HDF5 with vaex, will that work?

Cheers!

anubhav-nd commented 4 years ago

The pyarrow version was 0.15.1. I then upgraded it to 0.16.0. This also gives the same error.

I tried exporting mock.arrow to hdf5 as follows:

data = vx.open('mock.arrow')
data.export_hdf5('mock.hdf5')

This gives the following error:

ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 97, in evaluate
    result = self[expression]
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 144, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'str_byte_length(names)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 215, in execute
    cancel=cancel, unpack=True):
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 58, in map
    for i, value in enumerate(iterator):
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 47, in wrapped
    callable(self.local.index, *args, **kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in process
    block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in <dictcomp>
    block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 103, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 128, in __getitem__
    self.values[variable] = self.df.columns[variable][offset+self.i1:offset+self.i2]
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 411, in __getitem__
    return self.trim(start, stop)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 418, in trim
    byte_end = self.indices[i2:i2+1][0] - self.offset
IndexError: index 0 is out of bounds for axis 0 with size 0
Traceback (most recent call last):
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 97, in evaluate
    result = self[expression]
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 144, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'str_byte_length(names)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 5616, in export_hdf5
    vaex.export.export_hdf5(self, path, column_names, byteorder, shuffle, selection, progress=progress, virtual=virtual, sort=sort, ascending=ascending)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/export.py", line 311, in export_hdf5
    vaex.hdf5.export.export_hdf5(**kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/hdf5/export.py", line 164, in export_hdf5
    byte_length = dataset[column_name].str.byte_length().sum(selection=selection)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/expression.py", line 491, in sum
    return self.ds.sum(**kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 852, in sum
    return self._compute_agg('sum', expression, binby, limits, shape, selection, delay, edges, progress)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 686, in _compute_agg
    return self._delay(delay, var)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 1428, in _delay
    self.execute()
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 293, in execute
    self.executor.execute()
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 215, in execute
    cancel=cancel, unpack=True):
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 58, in map
    for i, value in enumerate(iterator):
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 47, in wrapped
    callable(self.local.index, *args, **kwargs)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in process
    block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in <dictcomp>
    block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 103, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 128, in __getitem__
    self.values[variable] = self.df.columns[variable][offset+self.i1:offset+self.i2]
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 411, in __getitem__
    return self.trim(start, stop)
  File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 418, in trim
    byte_end = self.indices[i2:i2+1][0] - self.offset
IndexError: index 0 is out of bounds for axis 0 with size 0

Does this give some hint to where the problem might be?

anubhav-nd commented 4 years ago

On a 2nd thought, would it be possible for you to share output of pip freeze on your system. I can then compare it against mine and figure out if we have some dependency mismatch.

anubhav-nd commented 4 years ago

hi,

just hoping for some update on this. I am still stuck here. Any help is appreciated

maartenbreddels commented 4 years ago

Hi. I tried to reproduce it, and I can. Using Jovan's code, if I open 'mock.arrow', it works, but the data inside doesn't seem valid. E.g. len(df) != len(df.columns['names']).

maartenbreddels commented 4 years ago

I've tested, and this branch/PR #517 has no problems. Are you comfortable doing a dev install, and work with that branch for now? https://docs.vaex.io/en/latest/installing.html#for-developers might help.

anubhav-nd commented 4 years ago

@maartenbreddels Thanks a lot for pointing me to this branch. Yes, it will work for now.

anubhav-nd commented 4 years ago

Sorry, reporting this after a few days (had to work on an urgent project).

branch/PR https://github.com/vaexio/vaex/pull/517 worked. Is there some visibility on when this fix will be available in the main branch?

maartenbreddels commented 4 years ago

Thanks for the feedback, at least 1 month away I think.

anubhav-nd commented 4 years ago

Thanks. Looking forward to it :)

xinnyuann commented 4 years ago

Thanks for the feedback! I had the same error. But it happened after vaex.open() a large arrow file (24gb, shape: (264934825, 13)), when I tried to convert string date column to 'datetime64[ns]' . My pyarrow version is 0.16.0. Is there any update on this issue?

JovanVeljanoski commented 4 years ago

Can you try updating pyarrow? I think the latest version is 1.0.1.

xinnyuann commented 4 years ago

Thanks Jovan, I tried same thing on pyarrow 1.0.1, still getting the same error. Here's the version of my packages

pyarrow 1.0.1 vaex 3.0.0 vaex-arrow 0.5.1 vaex-astro 0.7.0 vaex-core 2.0.3 vaex-hdf5 0.6.0 vaex-jupyter 0.5.2 vaex-ml 0.9.0 vaex-server 0.3.1 vaex-viz 0.4.0

Are there any other ways I can try and see if we can solve this?

crliu95 commented 3 years ago

I've tested, and this branch/PR #517 has no problems. Are you comfortable doing a dev install, and work with that branch for now? https://docs.vaex.io/en/latest/installing.html#for-developers might help.

Hi @maartenbreddels and @JovanVeljanoski ,

I have been struggling with the similar bug for days. What I am doing is that extracting some key informations from text by REGEX. Here is an example:

import re
import vaex
path_list = ['e10_zw_000.hdf5', 'e10_zw_001.hdf5', 'e10_zw_002.hdf5', 'e10_zw_003.hdf5']
test_df = vaex.open_many(path_list)

re_pat = re.compile(r'Judge(.*?)\n')
def test_func(text_str):
    return re.findall(re_pat, text_str)

test_df['result'] = test_df.document.apply(test_func) # the document variable is quite long, e.g. in terms of 14k chars.
test_df.head(24) #if I display the data with test_df.head(23), it would be OK!

And I got the following ERRORS:

ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex:error evaluating: result at rows 220667-220672
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3567, in table_part
    values[name] = df.evaluate(name)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex:error evaluating: result at rows 220667-220672
Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3567, in table_part
    values[name] = df.evaluate(name)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
    return self._delay(delay, task)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
    self.execute()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
    just_run(self.execute_async())
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
    result = coro.throw(exc)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
    await self.executor.execute_async()
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
    cancel=lambda: self._cancel(run), unpack=True, run=run):
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
    value = await value
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
    yield self  # This tells Task to wait for completion.
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
    future.result()
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
    raise self._exception
  File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
    task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
    self.values.append(self._map(thread_index, i1, i2, *blocks))
  File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
    arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)

This bug is kind of weird since the ERRORs show up only when I try to apply the function to large enough samples, for example, call test_df.head(23) is OK but test_df.head(24) ends up with these ERRORs (but still give part of the results, for the top 23 rows, I think).

I am using Python 3.6.12 in Mniconda on a non-admin user account in Windows 10. I have tried several updated versions of VAEX by pip install vaex=4.0.0a13 or manul installation (with admin privilege) from the GitHub codes guided by the Doc. But still got the same bug all the time.

The only working solution for me till now is that (1)switch to the admin account; (2)install the newest MS VS build tools, suggested when try to manully installing VAEX from GitHub; (3)mannually install (with admin privilege) from the GitHub codes guided by the Doc.

But when I switched back to the non-admin user account and replicated these things, the ERRORs popped up again, which was really frustrating. Since it is quite inconvenient for me to use the admin account, I am still looking forward to the final & handy solution to this bug.

Thanks in advance for anyone who could help.