vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.3k stars 591 forks source link

group by not working in PR: 517 #694

Open anubhav-nd opened 4 years ago

anubhav-nd commented 4 years ago

This is regarding PR: https://github.com/vaexio/vaex/pull/517

I moved to this branch as it resolves ISSUE: https://github.com/vaexio/vaex/issues/644

PyArrow version: 0.17.0 Numpy version: 1.18.3

The df.groupby is failing with following error:

AttributeError: 'numpy.ndarray' object has no attribute 'string_sequence'

Minimum reproducible example (copied mostly from ISSUE: https://github.com/vaexio/vaex/issues/644) :

import vaex
import vaex.ml
import hashlib
import numpy as np

def get_hash(x):
    return hashlib.sha256(x.encode('utf-8')).hexdigest()

# Create a mock dataset
df_titanic = vaex.ml.datasets.load_titanic()

names = np.random.choice(df_titanic['name'].values, size=10_000_000)
survived =  np.random.choice(df_titanic['survived'].values, size=10_000_000)
df_mock = vaex.from_arrays(names=names, survived=survived)

# Hashing
df_mock['hash'] = df_mock.apply(f=get_hash, arguments=['names'])
df_mock = df_mock.materialize('hash')

# Export
df_mock.export_arrow('mock.arrow')

#Open and test
data = vaex.open('mock.arrow')
group = data.groupby(data.survived, agg={'counts': vaex.agg.count('survived')})

Should I share the full traceback of the error?

Also, when I ran this as part of my code it gave the following error which is different than the one coming in minimum example: File "pyarrow/table.pxi", line 151, in pyarrow.lib.ChunkedArray.__getitem__ TypeError: key must either be a slice or integer

maartenbreddels commented 4 years ago

Thanks, workaround is: vaex.open('mock.arrow').as_numpy() as indicated in #517

I'll keep this open til fixed.

anubhav-nd commented 4 years ago

Hi,

The above thing worked but when I try to groupby on string/float columns I am getting the following error: TypeError: only integer scalar arrays can be converted to a scalar index.

maartenbreddels commented 4 years ago

Can you provide a code example that triggers this?

anubhav-nd commented 4 years ago

The minimum working example is as follows:

import vaex
import vaex.ml
import hashlib
import numpy as np

def get_hash(x):
    return hashlib.sha256(x.encode('utf-8')).hexdigest()

# Create a mock dataset
df_titanic = vaex.ml.datasets.load_titanic()

names = np.random.choice(df_titanic['name'].values, size=10_000_000)
survived =  np.random.choice(df_titanic['survived'].values, size=10_000_000)
df_mock = vaex.from_arrays(names=names, survived=survived)

# Hashing
df_mock['hash'] = df_mock.apply(f=get_hash, arguments=['names'])
df_mock = df_mock.materialize('hash')

# Export
df_mock.export_arrow('mock.arrow')

#Open and test
data = vaex.open('mock.arrow').as_numpy()
data_f = data.filter(data.names.isin(['Fortune, Mr. Mark', 'Kreuchen, Miss. Emilie']))

#This line throws error
group = data_f.groupby(data_f.names, agg={'counts': vaex.agg.count('survived')})
anubhav-nd commented 4 years ago

hi,

I hope this was reproducible. Any idea if this is a bug or am I doing something wrong?

maartenbreddels commented 4 years ago

it now seems to work with that branch (i rebased it), but now will give issues when using

data = vaex.open('mock.arrow', as_numpy=False)
maartenbreddels commented 4 years ago

Correction, that branch executes all test with as_numpy=True, since we don't support computing with Arrow arrays yet. Once Arrow gets more compute kernels, we will try.