vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second πŸš€
https://vaex.io
MIT License
8.28k stars 589 forks source link

Certain characters cannot be used for virtual column names #1559

Open SyureNyanko opened 3 years ago

SyureNyanko commented 3 years ago

Hi.πŸ‘‹

Certain characters cannot be used for virtual column names, and some multi-byte characters seem to correspond to that. for example, "οΌ‘", "οΌ’" .and "β… ", "β…‘"

import vaex

data = {'A':[1,2,3],'B':['a','b','c']}
df = vaex.from_dict(data)
df["β… "] = df['A']
print(df)

Error messages (below) says RuntimeError: Oops, requesting column I from dataset, but it does not exist , but I used "β… " not "I".

(base) PS C:\Users\heya\vaex-test> python .\main.py
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\execution.py", line 250, in execute_async
    raise RuntimeError(f'Oops, requesting column {column} from dataset, but it does not exist')
RuntimeError: Oops, requesting column I from dataset, but it does not exist
ERROR:MainThread:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 3815, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 2850, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 6069, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 402, in map_reduce
    return self._delay(delay, task)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 1521, in _delay
    self.execute()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 385, in execute
    just_run(self.execute_async())
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\heya\miniconda3\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "C:\Users\heya\miniconda3\lib\asyncio\futures.py", line 201, in result
    raise self._exception
  File "C:\Users\heya\miniconda3\lib\asyncio\tasks.py", line 256, in __step
    result = coro.send(None)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 389, in execute_async
    await self.executor.execute_async()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\execution.py", line 250, in execute_async
    raise RuntimeError(f'Oops, requesting column {column} from dataset, but it does not exist')
RuntimeError: Oops, requesting column I from dataset, but it does not exist

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\execution.py", line 250, in execute_async
    raise RuntimeError(f'Oops, requesting column {column} from dataset, but it does not exist')
RuntimeError: Oops, requesting column I from dataset, but it does not exist
ERROR:MainThread:vaex:error evaluating: β…  at rows 0-3
Traceback (most recent call last):
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 3815, in table_part
    values = dict(zip(column_names, df.evaluate(column_names)))
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 2850, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 6069, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 402, in map_reduce
    return self._delay(delay, task)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 1521, in _delay
    self.execute()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 385, in execute
    just_run(self.execute_async())
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\heya\miniconda3\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "C:\Users\heya\miniconda3\lib\asyncio\futures.py", line 201, in result
    raise self._exception
  File "C:\Users\heya\miniconda3\lib\asyncio\tasks.py", line 256, in __step
    result = coro.send(None)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 389, in execute_async
    await self.executor.execute_async()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\execution.py", line 250, in execute_async
    raise RuntimeError(f'Oops, requesting column {column} from dataset, but it does not exist')
RuntimeError: Oops, requesting column I from dataset, but it does not exist

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 3820, in table_part
    values[name] = df.evaluate(name)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 2850, in evaluate
    return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 6069, in _evaluate_implementation
    df.map_reduce(assign, lambda *_: None, expression_to_evaluate, ignore_filter=False, selection=selection, pre_filter=use_filter, info=True, to_numpy=False)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 402, in map_reduce
    return self._delay(delay, task)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 1521, in _delay
    self.execute()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 385, in execute
    just_run(self.execute_async())
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\asyncio.py", line 35, in just_run
    return loop.run_until_complete(coro)
  File "C:\Users\heya\miniconda3\lib\site-packages\nest_asyncio.py", line 70, in run_until_complete
    return f.result()
  File "C:\Users\heya\miniconda3\lib\asyncio\futures.py", line 201, in result
    raise self._exception
  File "C:\Users\heya\miniconda3\lib\asyncio\tasks.py", line 256, in __step
    result = coro.send(None)
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\dataframe.py", line 389, in execute_async
    await self.executor.execute_async()
  File "c:\users\heya\vaex-test\vaex\packages\vaex-core\vaex\execution.py", line 250, in execute_async
    raise RuntimeError(f'Oops, requesting column {column} from dataset, but it does not exist')
RuntimeError: Oops, requesting column I from dataset, but it does not exist
  #    A  B    β… 
  0    1  a    error
  1    2  b    error
  2    3  c    error
(base) PS C:\Users\heya\vaex-test> 

If there is any information you need, please let me know. Thank you!

Software information

SyureNyanko commented 3 years ago

I've explored in vaex source code, and I'll make some notes.
Vaex seems to be using python's ast parse to achieve query-like(df['a' > 10]) operations in expresso.py (about line 634). So, characters that can be used for column names seem to be limited to characters that can be used in Python variables when using virtual column. For example, "😊😊" seems to be a character that cannot be used in column names.

import vaex
data = {'A':[1,2,3],'B':['a','b','c']}
df = vaex.from_dict(data)
df["😊😊"] = df['A']
print(df)
#  AttributeError: 'Subscript' object has no attribute 'id'

And, "1" and "οΌ‘" are interpreted as being the same in python ast and end up being inconsistent in vaex.

(base) PS C:\Users\heya\vaex-test> python 
Python 3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> var1 = "hogehoge"
>>> print(var1)
hogehoge
>>> print(varοΌ‘)
hogehoge
>>>

Please tell me if I'm wrong. Thanks!