Open anubhav-nd opened 4 years ago
Thanks, workaround is: vaex.open('mock.arrow').as_numpy() as indicated in #517
I'll keep this open til fixed.
Hi,
The above thing worked but when I try to groupby on string/float columns I am getting the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
.
Can you provide a code example that triggers this?
The minimum working example is as follows:
import vaex
import vaex.ml
import hashlib
import numpy as np
def get_hash(x):
return hashlib.sha256(x.encode('utf-8')).hexdigest()
# Create a mock dataset
df_titanic = vaex.ml.datasets.load_titanic()
names = np.random.choice(df_titanic['name'].values, size=10_000_000)
survived = np.random.choice(df_titanic['survived'].values, size=10_000_000)
df_mock = vaex.from_arrays(names=names, survived=survived)
# Hashing
df_mock['hash'] = df_mock.apply(f=get_hash, arguments=['names'])
df_mock = df_mock.materialize('hash')
# Export
df_mock.export_arrow('mock.arrow')
#Open and test
data = vaex.open('mock.arrow').as_numpy()
data_f = data.filter(data.names.isin(['Fortune, Mr. Mark', 'Kreuchen, Miss. Emilie']))
#This line throws error
group = data_f.groupby(data_f.names, agg={'counts': vaex.agg.count('survived')})
hi,
I hope this was reproducible. Any idea if this is a bug or am I doing something wrong?
it now seems to work with that branch (i rebased it), but now will give issues when using
data = vaex.open('mock.arrow', as_numpy=False)
Correction, that branch executes all test with as_numpy=True
, since we don't support computing with Arrow arrays yet. Once Arrow gets more compute kernels, we will try.
This is regarding PR: https://github.com/vaexio/vaex/pull/517
I moved to this branch as it resolves ISSUE: https://github.com/vaexio/vaex/issues/644
PyArrow version:
0.17.0
Numpy version:1.18.3
The
df.groupby
is failing with following error:AttributeError: 'numpy.ndarray' object has no attribute 'string_sequence'
Minimum reproducible example (copied mostly from ISSUE: https://github.com/vaexio/vaex/issues/644) :
Should I share the full traceback of the error?
Also, when I ran this as part of my code it gave the following error which is different than the one coming in minimum example:
File "pyarrow/table.pxi", line 151, in pyarrow.lib.ChunkedArray.__getitem__ TypeError: key must either be a slice or integer