Open anubhav-nd opened 4 years ago
Hi @anubhav-nd wor I do not see anything wrong in your code example, in fact that should work. I have not been able to reproduce the issue locally: this code example works just find (10M rows), you can try it out yourself:
import vaex
import vaex.ml
import hashlib
import numpy as np
def get_hash(x):
return hashlib.sha256(x.encode('utf-8')).hexdigest()
# Create a mock dataset
df_titanic = vaex.ml.datasets.load_titanic()
names = np.random.choice(df_titanic['name'].values, size=10_000_000)
df_mock = vaex.from_arrays(names=names)
# Hashing
df_mock['hash'] = df_mock.apply(f=get_hash, arguments=['names'])
df_mock = df_mock.materialize('hash')
# Export
df_mock.export_arrow('/Users/jovan/Desktop/mock.arrow')
If the above example works for you (the file created will be just under 1GB), that leads me to believe it is something to do with the arrow file itself. Looking at the error message, do you perhaps do some filtering, selections or slicing? The error message is very similar to doing this:
import vaex
import vaex.ml
df_titanic = vaex.ml.datasets.load_titanic()
df_filter = df_titanic[df_titanic.age > 45]
print(len(df_filter)) # The result here is 155
my_index = np.arange(155)
df_filter['my_index'] = my_index
which raises:
ValueError: Array is of length 155, while the length of the DataFrame is 155 due to the filtering, the (unfiltered) length is 1309.
Hope this helps.
hi @JovanVeljanoski,
Thanks for such a quick turnaround. Unfortunately, the example that you've shared is giving me the same problem. mock.arrow
gets created without any issue, size on disk ~945MB. But when I open it as follows, I get the exact same error.
data = vx.open('mock.arrow')
data.hash
output is
Expression = hash
Length: 10,000,000 dtype: str (column)
--------------------------------------
Error evaluating: ValueError('array is of length 262144, while the length of the DataFrame is 10000000')
Something wrong with my vaex installation or settings?
Update: Tried the same thing on a linux machine. Getting the exact same error.
Systems Tried:
Mac OSX, Python 3.7.3 Ubuntu 16.04, Python 3.7.6
Hi,
Seems like an installation or dependency error perhaps. Can you tell me which version of pyarrow
you have?
Also, can you try repeating the above exercise but exporting the file to HDF5 with vaex, will that work?
Cheers!
The pyarrow
version was 0.15.1
. I then upgraded it to 0.16.0
. This also gives the same error.
I tried exporting mock.arrow
to hdf5 as follows:
data = vx.open('mock.arrow')
data.export_hdf5('mock.hdf5')
This gives the following error:
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 97, in evaluate
result = self[expression]
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 144, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'str_byte_length(names)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 215, in execute
cancel=cancel, unpack=True):
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 58, in map
for i, value in enumerate(iterator):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 47, in wrapped
callable(self.local.index, *args, **kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in process
block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in <dictcomp>
block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 103, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 128, in __getitem__
self.values[variable] = self.df.columns[variable][offset+self.i1:offset+self.i2]
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 411, in __getitem__
return self.trim(start, stop)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 418, in trim
byte_end = self.indices[i2:i2+1][0] - self.offset
IndexError: index 0 is out of bounds for axis 0 with size 0
Traceback (most recent call last):
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 97, in evaluate
result = self[expression]
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 144, in __getitem__
raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'str_byte_length(names)'"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 5616, in export_hdf5
vaex.export.export_hdf5(self, path, column_names, byteorder, shuffle, selection, progress=progress, virtual=virtual, sort=sort, ascending=ascending)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/export.py", line 311, in export_hdf5
vaex.hdf5.export.export_hdf5(**kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/hdf5/export.py", line 164, in export_hdf5
byte_length = dataset[column_name].str.byte_length().sum(selection=selection)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/expression.py", line 491, in sum
return self.ds.sum(**kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 852, in sum
return self._compute_agg('sum', expression, binby, limits, shape, selection, delay, edges, progress)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 686, in _compute_agg
return self._delay(delay, var)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 1428, in _delay
self.execute()
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/dataframe.py", line 293, in execute
self.executor.execute()
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 215, in execute
cancel=cancel, unpack=True):
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 58, in map
for i, value in enumerate(iterator):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
yield fs.pop().result()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
return self.__get_result()
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/multithreading.py", line 47, in wrapped
callable(self.local.index, *args, **kwargs)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in process
block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/execution.py", line 192, in <dictcomp>
block_dict = {expression: block_scope.evaluate(expression) for expression in expressions}
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 103, in evaluate
result = eval(expression, expression_namespace, self)
File "<string>", line 1, in <module>
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/scopes.py", line 128, in __getitem__
self.values[variable] = self.df.columns[variable][offset+self.i1:offset+self.i2]
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 411, in __getitem__
return self.trim(start, stop)
File "/Users/anubhav-macbookpro/Work/KPI-Regression/python3.7_tmp/lib/python3.7/site-packages/vaex/column.py", line 418, in trim
byte_end = self.indices[i2:i2+1][0] - self.offset
IndexError: index 0 is out of bounds for axis 0 with size 0
Does this give some hint to where the problem might be?
On a 2nd thought, would it be possible for you to share output of pip freeze
on your system. I can then compare it against mine and figure out if we have some dependency mismatch.
hi,
just hoping for some update on this. I am still stuck here. Any help is appreciated
Hi. I tried to reproduce it, and I can. Using Jovan's code, if I open 'mock.arrow', it works, but the data inside doesn't seem valid. E.g. len(df) != len(df.columns['names'])
.
I've tested, and this branch/PR #517 has no problems. Are you comfortable doing a dev install, and work with that branch for now? https://docs.vaex.io/en/latest/installing.html#for-developers might help.
@maartenbreddels Thanks a lot for pointing me to this branch. Yes, it will work for now.
Sorry, reporting this after a few days (had to work on an urgent project).
branch/PR https://github.com/vaexio/vaex/pull/517 worked. Is there some visibility on when this fix will be available in the main branch?
Thanks for the feedback, at least 1 month away I think.
Thanks. Looking forward to it :)
Thanks for the feedback! I had the same error. But it happened after vaex.open() a large arrow file (24gb, shape: (264934825, 13)), when I tried to convert string date column to 'datetime64[ns]' . My pyarrow version is 0.16.0. Is there any update on this issue?
Can you try updating pyarrow? I think the latest version is 1.0.1.
Thanks Jovan, I tried same thing on pyarrow 1.0.1, still getting the same error. Here's the version of my packages
pyarrow 1.0.1 vaex 3.0.0 vaex-arrow 0.5.1 vaex-astro 0.7.0 vaex-core 2.0.3 vaex-hdf5 0.6.0 vaex-jupyter 0.5.2 vaex-ml 0.9.0 vaex-server 0.3.1 vaex-viz 0.4.0
Are there any other ways I can try and see if we can solve this?
I've tested, and this branch/PR #517 has no problems. Are you comfortable doing a dev install, and work with that branch for now? https://docs.vaex.io/en/latest/installing.html#for-developers might help.
Hi @maartenbreddels and @JovanVeljanoski ,
I have been struggling with the similar bug for days. What I am doing is that extracting some key informations from text by REGEX. Here is an example:
import re
import vaex
path_list = ['e10_zw_000.hdf5', 'e10_zw_001.hdf5', 'e10_zw_002.hdf5', 'e10_zw_003.hdf5']
test_df = vaex.open_many(path_list)
re_pat = re.compile(r'Judge(.*?)\n')
def test_func(text_str):
return re.findall(re_pat, text_str)
test_df['result'] = test_df.document.apply(test_func) # the document variable is quite long, e.g. in terms of 14k chars.
test_df.head(24) #if I display the data with test_df.head(23), it would be OK!
And I got the following ERRORS:
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
values = dict(zip(column_names, df.evaluate(column_names)))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex:error evaluating: result at rows 220667-220672
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
values = dict(zip(column_names, df.evaluate(column_names)))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3567, in table_part
values[name] = df.evaluate(name)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
values = dict(zip(column_names, df.evaluate(column_names)))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
ERROR:MainThread:vaex:error evaluating: result at rows 220667-220672
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3562, in table_part
values = dict(zip(column_names, df.evaluate(column_names)))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 3567, in table_part
values[name] = df.evaluate(name)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 2631, in evaluate
return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5535, in _evaluate_implementation
df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 323, in map_reduce
return self._delay(delay, task)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 1497, in _delay
self.execute()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 306, in execute
just_run(self.execute_async())
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
return loop.run_until_complete(coro)
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
return f.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 182, in _step
result = coro.throw(exc)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 311, in execute_async
await self.executor.execute_async()
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 182, in execute_async
cancel=lambda: self._cancel(run), unpack=True, run=run):
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 88, in map_async
value = await value
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 327, in __iter__
yield self # This tells Task to wait for completion.
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\tasks.py", line 250, in _wakeup
future.result()
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\asyncio\futures.py", line 243, in result
raise self._exception
File "C:\Users\temp_user\miniconda3\envs\vaex_env\lib\concurrent\futures\thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 84, in <lambda>
iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\multithreading.py", line 76, in wrapped
return callable(self.local.index, *args, **kwargs, **kwargs_extra)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\execution.py", line 244, in process_part
task._parts[thread_index].process(thread_index, i1, i2, filter_mask, *blocks)
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\cpu.py", line 126, in process
self.values.append(self._map(thread_index, i1, i2, *blocks))
File "c:\users\temp_user\vaex\packages\vaex-core\vaex\dataframe.py", line 5533, in assign
arrays[expr][i1:i2] = blocks[i]
ValueError: could not broadcast input array from shape (5) into shape (5,0)
This bug is kind of weird since the ERRORs show up only when I try to apply the function to large enough samples, for example, call test_df.head(23)
is OK but test_df.head(24)
ends up with these ERRORs (but still give part of the results, for the top 23 rows, I think).
I am using Python 3.6.12 in Mniconda on a non-admin user account in Windows 10. I have tried several updated versions of VAEX by pip install vaex=4.0.0a13
or manul installation (with admin privilege) from the GitHub codes guided by the Doc. But still got the same bug all the time.
The only working solution for me till now is that (1)switch to the admin account; (2)install the newest MS VS build tools
, suggested when try to manully installing VAEX from GitHub; (3)mannually install (with admin privilege) from the GitHub codes guided by the Doc.
But when I switched back to the non-admin user account and replicated these things, the ERRORs popped up again, which was really frustrating. Since it is quite inconvenient for me to use the admin account, I am still looking forward to the final & handy solution to this bug.
Thanks in advance for anyone who could help.
Hi,
I have a dataset with ~1 million rows and I want to run apply function on a column (not the best thing to do, but let's say we just want to do it). I am doing this the following way:
when I try to access
data.hash
on the new file, I get the following error:Error evaluating: ValueError('array is of length 262144, while the length of the DataFrame is 1179500')
The same steps work fine on datasets with much lesser no. of rows say ~100k.
The only reference to this problem that I could find was at https://github.com/vaexio/vaex/issues/355 but there it has been said that this problem is for datasets larger than memory of machine which I think will not be the situation in my case.
Am I doing something wrong here?