Open picoDoc opened 5 years ago
Hi @picoDoc
Thank you for trying out vaex.
May I ask if you can offer some example of the type of categorical data you are using?
I of course do not know your exact use-case, but maybe you could try something like this:
import vaex
import vaex.ml
df = vaex.ml.datasets.load_titanic()
encode = vaex.ml.LabelEncoder(features=['embarked', 'sex'], prefix='enc_')
df = encode.fit_transform(df)
print(df[['sex', 'embarked', 'enc_sex', 'enc_embarked']])
Following this, the 'enc_sex', 'enc_embarked'
will behave as your standard virtual columns, and you can use them as any other.
Let me know if this is helpful. Cheers, Jovan.
Hey Jovan,
So the data I used in my sample notebook looks something like:
datetime,pair,bid,ask
2018-12-02D17:00:41.143000000,USD/JPY,113.632,113.702
2018-12-02D17:00:45.393000000,USD/JPY,113.631,113.701
2018-12-02D17:00:47.893000000,USD/JPY,113.642,113.691
2018-12-02D17:00:48.143000000,USD/JPY,113.64,113.693
2018-12-02D17:00:49.893000000,USD/JPY,113.639,113.692
This is fairly typical of the kind of thing I'm interested in, where there's something like an "id" column (in this case it's currency pair, but its not so important what it is).
Really what I'm interested in is saving my data in a format such that it's as efficient as possible to carry out operations that filter or group on this id column - for which there might only be a small number of unique values - while using minimal memory. For example if my data is very large I want to be able to filter down quickly to just rows matching a given id. I was thinking it might behave similarly to the pandas Categorical type, and might be more efficient for memory and speed than simply storing strings. What do you think is the best approach here?
I tried you example above, however the added virtual columns don't persist if I call export_hdf5
, so I imagine they won't have much performance impact, as the data is still fundamentally stored as strings on disk?
Hey,
So the new columns you are creating with the approach I suggested are virtual. So for them to persist to disk while exporting set: virtual=False
in export_hdf5
.
If you can afford RAM for this one column, you could instead do df.materialize(virtual_column=['enc_bid'], inplace=True) and the virtual column (in this case I assume that to be 'enc_bid') will not be calculated on the fly but be stored in memory, so all operations will be much faster.
Hope this helps! Jovan.
Hi Matt,
So first of all great work! I
Thanks :)
I work with a lot of time-series data which is very well suited to vaex
Awesome, we could also use some feedback/help in that area. At EuroScipy, together with @joy-rosie we found some issues regarding datetime, see e.g. #407 #408. In any case, feedback from users would be welcome, contributions as well (we can guide you a bit). One missing feature that @joy-rosie had was missing good support for resampling (aggregating per day/week/month/year etc). Although it can be done, it's sometimes clunky.
You're actually hitting some issues that I've been thinking about recently.
So, great for opening this issue, I'll keep this open so we can fix some of this.
cheers,
Maarten
Hey Maarten and Jovan,
Ok so I got the encoded columns saving down as Jovan suggested. Here's another gist showing what I've been playing about with. Basically filtering and grouping on the encoded int column is a decent amount faster than on the string column, which makes sense (although I'm sure will depend alot of the details of the particular data and operations). But it would be much nicer to get this performance improvement while also keeping the ease of use of strings i.e. like you say Maarten it should behave like the original column, with the labels stored as a dictionary.
I would be very interested in helping out/contributing in this area if you guys are willing to point me in roughly the right direction?
I would also be very interested in more support for timeseries resampling, so it's cool to hear that's being looked at too :-)
Hi Matt,
I'm glad that you got your thing working :).
I don't know if your end goal is to build a ML pipeline (for which using transformers is the standard thing to do), or you are doing an exploration / analysis of past events (which which case I understand why you would not want to use a transformer necessarily, although that is my personal preferred method).
If the usecase is the latter, you can consider building your own dictionary of unique elements, and then use the map
method to create a virtual column with the "encoded" values. This is essentially the same thing that the LabelEncoder
does, it builds a dict of unique values, and uses map the apply them.
You are more than welcome to contribute, I'm sure @maartenbreddels would be happy to give you some pointers, depending on what kind of things you'd be interested in doing.
Hi,
I'm interested on this too.
My use case is for exploration.
Making a string column as category reduces a lot the memory and disk space used.
In pandas we can continue to filter by the string and view the column as if it were still string.
But in vaex the category columns appear as int8/int16 dtype and all operations must be done using the category int value and the value showed is also the int value, which is not very intuitive.
I wonder if there is an easy way to work with the category label in vaex.
I managed to do this with something like:
df.select(df.cat_column==df.category_labels('cat_column').index('Some Category'))
But it would be nice to be able to do something like:
df.select(df.cat_column.label=='Some Category')
Hi @lfkpoa
If I understand your need.. there is nothing preventing you from simply:
df.select(df.cat_column == "Some_category")
or
df[(df.cat_column == "category_1") | (df.cat_column == "category_3")]
for example. No need to encode anything for this part.
I believe that in pandas categories are used to save on memory. Vaex does not have this problem to begin with, and here we use it to convert categorical columns (that are strings for example) to integers so that one can do mathematical operations, such as binning which may lead to a heamap plot for example.
It may be worth keeping in mind that not every method of vaex corresponds directly to the methods in pandas. While we try to do this for the most frequently used methods, given how pandas and vaex work internally, some methods just need to be different.
Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas + pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.
I fully agree, and thanks for those numbers. When we have the new arrow PR in we should review this. I'd like this to change for v3. Cheers
(from mobile phone)
On Fri, 14 Feb 2020, 20:31 Luis Fernando Kauer, notifications@github.com wrote:
Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas
- pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/421?email_source=notifications&email_token=AANPEPKOBNLLKAQS5UPVVFTRC3WQ3A5CNFSM4IX4XVOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL2F57Y#issuecomment-586440447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPXAZQ2HRM6AYGBZ3LRC3WQ3ANCNFSM4IX4XVOA .
Note that vaex does support reading dictionary encoded columns.
(from mobile phone)
On Fri, 14 Feb 2020, 20:37 Maarten Breddels, maartenbreddels@gmail.com wrote:
I fully agree, and thanks for those numbers. When we have the new arrow PR in we should review this. I'd like this to change for v3. Cheers
(from mobile phone)
On Fri, 14 Feb 2020, 20:31 Luis Fernando Kauer, notifications@github.com wrote:
Thanks for the quick reply. I know vaex can handle big files and that it is very fast. I was really impressed. My CSV file has 1.5 GB. The converted HDF5 file has 1.6 GB. The converted ARROW file has 1.6 GB without categorical columns. The converted ARROW file with categorical columns has 600 MB (used pandas
- pyarrow) The converted PARQUET file has 277 MB (without categorical columns) The converted PARQUET file with categorical columns has 277 MB. That's a huge difference. And I need to share the file, use it with other packages and applications. We are also currently using arrowjs to load data to the browser and size matters. The use of a categorical column in mathematical operations is great, but I think during selections the column should behave like a string column, but be more efficient. It should check the dictionary and apply the int selection automatically, like pandas and dask do. That's much more intuitive than selecting by the int value. By doing that, if it can not be used directly in mathematical operations (dealing with it internally), then there could be a column property that returns an expression with the numerical value or an easy way to get the numerical value. If that's not possible because it would break the compatibility with previous versions, then I suggest making available a property to treat the column as string easily. And I know, I could just use parquet, but arrowjs expects arrow buffers and some functions like countBy expect the columns to be categorical. Anyway, this is just a suggestion. Vaex is really great. Thank you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vaexio/vaex/issues/421?email_source=notifications&email_token=AANPEPKOBNLLKAQS5UPVVFTRC3WQ3A5CNFSM4IX4XVOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL2F57Y#issuecomment-586440447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPEPPXAZQ2HRM6AYGBZ3LRC3WQ3ANCNFSM4IX4XVOA .
Yes, I managed to read it from arrow and parquet files. The problem was selecting and viewing the category values (strings), because it treats it as an int8/int16 column. But I managed to get the column's labels using df.category_labels. Thank you.
@maartenbreddels is there a feature proposal on adding support for dictionary-encoded columns?
Dictionary support is still weak, but I think we should make it a higher priority after the 4.0 release. If we can add in more support without breaking things, a 4.x might have better support, otherwise 5.0.
We're already getting bitten by the upcoming support for partitioned datasets #1094 so we have no way out now :)
Vaex does not correctly read categoricals from a parquet file. I have a parquet file where all the columns are categoricals written out via dask. When I read it using pandas pd.read_parquet
the dtypes correctly show as category
, however when reading with vaex.open
the dtypes show as dictionarydfSchema['class1'] = dfSchema['class1'].astype('str')
in order to use the column in filters otherwise it gives me a very long error ending in ValueError: cannot create an OBJECT array from memory buffer
. The categorical column values do however display correctly. Is there a way to do this correctly in vaex and keep the columns as categoricals?
Also dfSchema.is_category('statement')
shows the categorical column as not being categorical (False).
I am not a frequent user or dask, but is category
something that is dask specific rather than a data type? Probably dask is doing dictionary encoding, and that is being exported to the parquet files, which dask interprets in a certain way. Vaex actually does something similar too when you categorize an expression, but that when you export, you export either floats, strings, dicts.. no metadata is exported (from either side i guess) on how those are to be interpreted.
These are just my assumptions, i could be wrong here. @maartenbreddels will know much better.
It was the dask categoricals. If I write out the pandas categoricals to hdf5 using dfp.to_hdf(path + 'yagoSchema.hdf5', key='df', mode='w', format='table')
, I then cannot read the hdf5 using dfSchema = vaex.open(path + "yagoSchema.hdf5")
; since it gives me OSError: Cannot open D:\Downloads\Wikipedia\Yago\yagoSchema.hdf5 nobody knows how to read it.
dfSchema['class1'] = dfSchema['class1'].astype('category')
gives ValueError: No type alias for category
although this seems to be valid for pandas.
See the FAQ in the docs about why the pandas export to hdf5 does not work for you.
As for the categoricals.. i think i answered tht above. There is no data type category
. It is something that dask and/or pandas implements with a particular strategy. It is specific to them. Vaex also has it, with its own strategy. When you export it to parquet, arrow, what you end up exporting is the raw underlying data, which is probably an arrow dictionary or something.
Writing a parquet file with pandas has the same issue:
dfc = pd.DataFrame({"A": ["a", "b", "c", "a"]})
dfc["B"] = dfc["A"].astype("category")
dfc.to_parquet('dfc.parquet')
dfc = vaex.open('dfc.parquet')
dfc.dtypes
So there is no way to get the categorical back in vaex?
Please read carefully what I wrote.. In fact.. you are getting the categoricals back in vaex! But you are getting them in the original format.
Let me try again
in your example do print(type(dfc.B.values))
and you will see this pandas.core.arrays.categorical.Categorical
. It is something that pandas defines, and exists only in pandas. I think (not sure) but it is their way of handling/using arrow dictionaries.
So when you export that to disk, in a parqet format, you save it in a way that everyone can understand, not just pandas. Since parquet can be read by many libraries (outside of python also). So the way this pandas-specific type is stored something like an arrow dictionary.
So when you open the parquet file with vaex, the dtype is telling you that the raw data underneath is an arrow dictionary. You can check this out:
So you see above that each original sample has an associated index, which is the trick -> instead of handling strings you are handling ints mapped to a string for faster operations (like binning, groupby etc). Vaex will do the same thing in v5, but will not define a new "type" category like pandas, but will tell you what is really happening under the hood.
So, you do get the categorized stuff, just when you do dtype
you don't get to see "category", but the true type of the underlying data.
Pandas and Vaex are not the same. The API of vaex has followed that of pandas as much as possible, but since the implementations are vastly different, some difference will appear. Making a completely interchangable api will likely never happen, so you will just have to learn each technology if you really need to know the nitty gritty details. For the basics (common things), i think it is fairly uniform..
Final note: vaex is memory efficient so the reason to categorize things is just for efficiency purposes (unlike pandas, where you get some memory benefits).
Thanks! Makes sense, but is there an easy way of doing df[df['B']=='a']
without looking up the index first?
I tried all of these, but they all error out:
dfc[dfc['B']=='a']
dfc[dfc['B']==0]
dfc.select(dfc.B==dfc.category_labels('B').index('a'))
dfc.select(dfc.B==dfc.category_labels('B').index(0))
Support for that is part of this PR: https://github.com/vaexio/vaex/pull/1544
Dunno why it has not been merged, will try to get it in next week.
Thanks!
Thanks! And thanks for a truly amazing library and always being so quick to help!
You are very welcome. I am making notes on all the questions.. hopefully we can address all of these concerns in the docs, but it takes more time than one might think.. :S.
So first of all great work! I've been playing with vaex for a while and it's a very powerful library, looking forward to seeing it progress.
I work with a lot of time-series data which is very well suited to vaex, and have been experimenting. One issue I've come across is the support for categoricals.
I've added a notebook gist with a detailed example of my issue. Basically the issue boils down to this:
I also tried marking columns as categorical after loading hdf5 data from disk, but the column then appears as numeric, and if I want to filter I have to do something clunky like this:
Is there a better way to work with categorical data in vaex? I'm also very interested in performance of operations on categorical columns i.e. filtering and grouping. I was hoping that using categorical data for columns where I have a small number of unique values should be much faster than leaving these columns as strings
For reference I'm using vaex 2.0.2. Any help appreciated!