vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.26k stars 590 forks source link

Trouble marking columns as categorical #421

Open picoDoc opened 5 years ago

picoDoc commented 5 years ago

So first of all great work! I've been playing with vaex for a while and it's a very powerful library, looking forward to seeing it progress.

I work with a lot of time-series data which is very well suited to vaex, and have been experimenting. One issue I've come across is the support for categoricals.

I've added a notebook gist with a detailed example of my issue. Basically the issue boils down to this:

# If I mark a column as a category
data = data.ordinal_encode('sample_categorical_column',inplace=True)

# and export out to hdf5 I get an error (see gist for details of the error)
data.export_hdf5('data/test_with_categoricals.hdf5')

I also tried marking columns as categorical after loading hdf5 data from disk, but the column then appears as numeric, and if I want to filter I have to do something clunky like this:

filter_pair = [i for i,x in enumerate(data_from_disk.category_labels(data_from_disk.pair)) if x == 'USD/GBP'][0]
data_from_disk[data_from_disk.pair == filter_pair]

Is there a better way to work with categorical data in vaex? I'm also very interested in performance of operations on categorical columns i.e. filtering and grouping. I was hoping that using categorical data for columns where I have a small number of unique values should be much faster than leaving these columns as strings

For reference I'm using vaex 2.0.2. Any help appreciated!

JovanVeljanoski commented 5 years ago

Hi @picoDoc

Thank you for trying out vaex.

May I ask if you can offer some example of the type of categorical data you are using?

I of course do not know your exact use-case, but maybe you could try something like this:

import vaex
import vaex.ml
df = vaex.ml.datasets.load_titanic()
encode = vaex.ml.LabelEncoder(features=['embarked', 'sex'], prefix='enc_')
df = encode.fit_transform(df)
print(df[['sex', 'embarked', 'enc_sex', 'enc_embarked']])

Following this, the 'enc_sex', 'enc_embarked' will behave as your standard virtual columns, and you can use them as any other.

Let me know if this is helpful. Cheers, Jovan.

picoDoc commented 5 years ago

Hey Jovan,

So the data I used in my sample notebook looks something like:

datetime,pair,bid,ask
2018-12-02D17:00:41.143000000,USD/JPY,113.632,113.702
2018-12-02D17:00:45.393000000,USD/JPY,113.631,113.701
2018-12-02D17:00:47.893000000,USD/JPY,113.642,113.691
2018-12-02D17:00:48.143000000,USD/JPY,113.64,113.693
2018-12-02D17:00:49.893000000,USD/JPY,113.639,113.692

This is fairly typical of the kind of thing I'm interested in, where there's something like an "id" column (in this case it's currency pair, but its not so important what it is).

Really what I'm interested in is saving my data in a format such that it's as efficient as possible to carry out operations that filter or group on this id column - for which there might only be a small number of unique values - while using minimal memory. For example if my data is very large I want to be able to filter down quickly to just rows matching a given id. I was thinking it might behave similarly to the pandas Categorical type, and might be more efficient for memory and speed than simply storing strings. What do you think is the best approach here?

I tried you example above, however the added virtual columns don't persist if I call export_hdf5, so I imagine they won't have much performance impact, as the data is still fundamentally stored as strings on disk?

JovanVeljanoski commented 5 years ago

Hey,

So the new columns you are creating with the approach I suggested are virtual. So for them to persist to disk while exporting set: virtual=False in export_hdf5.

If you can afford RAM for this one column, you could instead do df.materialize(virtual_column=['enc_bid'], inplace=True) and the virtual column (in this case I assume that to be 'enc_bid') will not be calculated on the fly but be stored in memory, so all operations will be much faster.

Hope this helps! Jovan.

maartenbreddels commented 5 years ago

Hi Matt,

So first of all great work! I

Thanks :)

I work with a lot of time-series data which is very well suited to vaex

Awesome, we could also use some feedback/help in that area. At EuroScipy, together with @joy-rosie we found some issues regarding datetime, see e.g. #407 #408. In any case, feedback from users would be welcome, contributions as well (we can guide you a bit). One missing feature that @joy-rosie had was missing good support for resampling (aggregating per day/week/month/year etc). Although it can be done, it's sometimes clunky.

You're actually hitting some issues that I've been thinking about recently.

So, great for opening this issue, I'll keep this open so we can fix some of this.

cheers,

Maarten

picoDoc commented 5 years ago

Hey Maarten and Jovan,

Ok so I got the encoded columns saving down as Jovan suggested. Here's another gist showing what I've been playing about with. Basically filtering and grouping on the encoded int column is a decent amount faster than on the string column, which makes sense (although I'm sure will depend alot of the details of the particular data and operations). But it would be much nicer to get this performance improvement while also keeping the ease of use of strings i.e. like you say Maarten it should behave like the original column, with the labels stored as a dictionary.

I would be very interested in helping out/contributing in this area if you guys are willing to point me in roughly the right direction?

I would also be very interested in more support for timeseries resampling, so it's cool to hear that's being looked at too :-)

JovanVeljanoski commented 5 years ago

Hi Matt,

I'm glad that you got your thing working :).

I don't know if your end goal is to build a ML pipeline (for which using transformers is the standard thing to do), or you are doing an exploration / analysis of past events (which which case I understand why you would not want to use a transformer necessarily, although that is my personal preferred method).

If the usecase is the latter, you can consider building your own dictionary of unique elements, and then use the map method to create a virtual column with the "encoded" values. This is essentially the same thing that the LabelEncoder does, it builds a dict of unique values, and uses map the apply them.

You are more than welcome to contribute, I'm sure @maartenbreddels would be happy to give you some pointers, depending on what kind of things you'd be interested in doing.

lfkpoa commented 4 years ago

Hi,

I'm interested on this too. My use case is for exploration. Making a string column as category reduces a lot the memory and disk space used. In pandas we can continue to filter by the string and view the column as if it were still string. But in vaex the category columns appear as int8/int16 dtype and all operations must be done using the category int value and the value showed is also the int value, which is not very intuitive. I wonder if there is an easy way to work with the category label in vaex. I managed to do this with something like: df.select(df.cat_column==df.category_labels('cat_column').index('Some Category')) But it would be nice to be able to do something like: df.select(df.cat_column.label=='Some Category')

JovanVeljanoski commented 4 years ago

Hi @lfkpoa

If I understand your need.. there is nothing preventing you from simply:

df.select(df.cat_column == "Some_category")

or

df[(df.cat_column == "category_1") | (df.cat_column == "category_3")]

for example. No need to encode anything for this part.

I believe that in pandas categories are used to save on memory. Vaex does not have this problem to begin with, and here we use it to convert categorical columns (that are strings for example) to integers so that one can do mathematical operations, such as binning which may lead to a heamap plot for example.

It may be worth keeping in mind that not every method of vaex corresponds directly to the methods in pandas. While we try to do this for the most frequently used methods, given how pandas and vaex work internally, some methods just need to be different.

lfkpoa commented 4 years ago

Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas + pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.

maartenbreddels commented 4 years ago

I fully agree, and thanks for those numbers. When we have the new arrow PR in we should review this. I'd like this to change for v3. Cheers

(from mobile phone)

On Fri, 14 Feb 2020, 20:31 Luis Fernando Kauer, notifications@github.com wrote:

Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas

  • pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/421?email_source=notifications&email_token=AANPEPKOBNLLKAQS5UPVVFTRC3WQ3A5CNFSM4IX4XVOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL2F57Y#issuecomment-586440447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPXAZQ2HRM6AYGBZ3LRC3WQ3ANCNFSM4IX4XVOA .

maartenbreddels commented 4 years ago

Note that vaex does support reading dictionary encoded columns.

(from mobile phone)

On Fri, 14 Feb 2020, 20:37 Maarten Breddels, maartenbreddels@gmail.com wrote:

I fully agree, and thanks for those numbers. When we have the new arrow PR in we should review this. I'd like this to change for v3. Cheers

(from mobile phone)

On Fri, 14 Feb 2020, 20:31 Luis Fernando Kauer, notifications@github.com wrote:

Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas

  • pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/421?email_source=notifications&email_token=AANPEPKOBNLLKAQS5UPVVFTRC3WQ3A5CNFSM4IX4XVOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL2F57Y#issuecomment-586440447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPXAZQ2HRM6AYGBZ3LRC3WQ3ANCNFSM4IX4XVOA .

lfkpoa commented 4 years ago

Yes, I managed to read it from arrow and parquet files. The problem was selecting and viewing the category values (strings), because it treats it as an int8/int16 column. But I managed to get the column's labels using df.category_labels. Thank you.

malthe commented 3 years ago

@maartenbreddels is there a feature proposal on adding support for dictionary-encoded columns?

maartenbreddels commented 3 years ago

Dictionary support is still weak, but I think we should make it a higher priority after the 4.0 release. If we can add in more support without breaking things, a 4.x might have better support, otherwise 5.0.

We're already getting bitten by the upcoming support for partitioned datasets #1094 so we have no way out now :)

Code4SAFrankie commented 2 years ago

Vaex does not correctly read categoricals from a parquet file. I have a parquet file where all the columns are categoricals written out via dask. When I read it using pandas pd.read_parquet the dtypes correctly show as category, however when reading with vaex.open the dtypes show as dictionary. I then have to do a dfSchema['class1'] = dfSchema['class1'].astype('str') in order to use the column in filters otherwise it gives me a very long error ending in ValueError: cannot create an OBJECT array from memory buffer. The categorical column values do however display correctly. Is there a way to do this correctly in vaex and keep the columns as categoricals?

Code4SAFrankie commented 2 years ago

Also dfSchema.is_category('statement') shows the categorical column as not being categorical (False).

JovanVeljanoski commented 2 years ago

I am not a frequent user or dask, but is category something that is dask specific rather than a data type? Probably dask is doing dictionary encoding, and that is being exported to the parquet files, which dask interprets in a certain way. Vaex actually does something similar too when you categorize an expression, but that when you export, you export either floats, strings, dicts.. no metadata is exported (from either side i guess) on how those are to be interpreted.

These are just my assumptions, i could be wrong here. @maartenbreddels will know much better.

Code4SAFrankie commented 2 years ago

It was the dask categoricals. If I write out the pandas categoricals to hdf5 using dfp.to_hdf(path + 'yagoSchema.hdf5', key='df', mode='w', format='table'), I then cannot read the hdf5 using dfSchema = vaex.open(path + "yagoSchema.hdf5"); since it gives me OSError: Cannot open D:\Downloads\Wikipedia\Yago\yagoSchema.hdf5 nobody knows how to read it.

Code4SAFrankie commented 2 years ago

dfSchema['class1'] = dfSchema['class1'].astype('category') gives ValueError: No type alias for category although this seems to be valid for pandas.

JovanVeljanoski commented 2 years ago

See the FAQ in the docs about why the pandas export to hdf5 does not work for you.

As for the categoricals.. i think i answered tht above. There is no data type category. It is something that dask and/or pandas implements with a particular strategy. It is specific to them. Vaex also has it, with its own strategy. When you export it to parquet, arrow, what you end up exporting is the raw underlying data, which is probably an arrow dictionary or something.

Code4SAFrankie commented 2 years ago

Writing a parquet file with pandas has the same issue:


dfc = pd.DataFrame({"A": ["a", "b", "c", "a"]})
dfc["B"] = dfc["A"].astype("category")
dfc.to_parquet('dfc.parquet')
dfc = vaex.open('dfc.parquet')
dfc.dtypes
Code4SAFrankie commented 2 years ago

So there is no way to get the categorical back in vaex?

JovanVeljanoski commented 2 years ago

Please read carefully what I wrote.. In fact.. you are getting the categoricals back in vaex! But you are getting them in the original format.

Let me try again

in your example do print(type(dfc.B.values)) and you will see this pandas.core.arrays.categorical.Categorical. It is something that pandas defines, and exists only in pandas. I think (not sure) but it is their way of handling/using arrow dictionaries.

So when you export that to disk, in a parqet format, you save it in a way that everyone can understand, not just pandas. Since parquet can be read by many libraries (outside of python also). So the way this pandas-specific type is stored something like an arrow dictionary.

So when you open the parquet file with vaex, the dtype is telling you that the raw data underneath is an arrow dictionary. You can check this out:

image

So you see above that each original sample has an associated index, which is the trick -> instead of handling strings you are handling ints mapped to a string for faster operations (like binning, groupby etc). Vaex will do the same thing in v5, but will not define a new "type" category like pandas, but will tell you what is really happening under the hood.

So, you do get the categorized stuff, just when you do dtype you don't get to see "category", but the true type of the underlying data.

Pandas and Vaex are not the same. The API of vaex has followed that of pandas as much as possible, but since the implementations are vastly different, some difference will appear. Making a completely interchangable api will likely never happen, so you will just have to learn each technology if you really need to know the nitty gritty details. For the basics (common things), i think it is fairly uniform..

Final note: vaex is memory efficient so the reason to categorize things is just for efficiency purposes (unlike pandas, where you get some memory benefits).

Code4SAFrankie commented 2 years ago

Thanks! Makes sense, but is there an easy way of doing df[df['B']=='a'] without looking up the index first?

I tried all of these, but they all error out:

dfc[dfc['B']=='a']
dfc[dfc['B']==0]
dfc.select(dfc.B==dfc.category_labels('B').index('a'))
dfc.select(dfc.B==dfc.category_labels('B').index(0))
JovanVeljanoski commented 2 years ago

Support for that is part of this PR: https://github.com/vaexio/vaex/pull/1544

Dunno why it has not been merged, will try to get it in next week.

Thanks!

Code4SAFrankie commented 2 years ago

Thanks! And thanks for a truly amazing library and always being so quick to help!

JovanVeljanoski commented 2 years ago

You are very welcome. I am making notes on all the questions.. hopefully we can address all of these concerns in the docs, but it takes more time than one might think.. :S.