vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.26k stars 590 forks source link

Supported column datatypes, and whether stateful ML prediction workflow is possible #952

Closed mujina93 closed 2 years ago

mujina93 commented 4 years ago

Hi!

What are the basic datatypes that vaex supports?

I couldn't find that information anywhere in the documentation.

I know you can have numeric and string columns in a vaex dataframe, but are more general containers (like columns of lists) or objects (general serializable objects, for example datetimes, which are really important) supported?

In particular, I am trying to understand if vaex is suited for supporting a workflow in which one can process columns of lists and output other lists.

E.g.

df = my dataframe
def very_complex_func(row):
    do_something_complex_on_entire(row)
    return l1, l2, l3, l4, l5 # returns 5 list of lists (each is a list of 3 lists of ints)
df.apply(very_complex_func, arguments=?all_columns?)
# is it possible to generate multiple columns from `apply`?
# or is my only way serializing nested lists as strings, generating
# a string column as ouput, and then deserializing when I need it
# again?

Thanks!

JovanVeljanoski commented 4 years ago

Hi,

Maybe I am wrong on this, but I thought lists / dicts are not primitive types.. (since they are collection of other stuff inside them).

Having said that, as vaex moves to depend on arrow will get more support for lists / struct types, and in the future perhaps other more complex structures. As we get more support for these types of data structures, we will try to implement fast/efficient methods to work with them. There is some support for working with lists now, but it is experimental (not sure if it is released, in master, or in a PR).

As what works with vaex: the dtypes of numpy are fully supported, and as we move toward proper arrow support, the dtype of arrow will be supported as well.

In general we try to avoid methods like .apply and use them as a last resort - since in that case one loses most if not all of the performance benefits vaex gives you.

The ideal workflow we envision (for the time being - this may change in the future), is that one uses tools like pandas to get the data in the right format (tools that give you the maximum flexibility, where you can have in memory manipulations and so on), and export (via vaex or arrow) to .arrow or .hdf5 (memory mappable file format). Then use vaex to do analysis etc. And this can scale to 100s of GB or billions of rows.

I hope this helps

mujina93 commented 4 years ago

Thank you @JovanVeljanoski, very helpful!

You are right regarding the inappropriate use of "primitive" for composite types. Perhaps a better title for my issue would be "column data types".

Anyways, just to understand better if vaex suits my needs: I have the typical machine learning setup, in which my dataframe (vaex DataFrame) is the X (data matrix of regressors/inputs). I need to apply the transform/predict/however you want to call it/ procedure to each row of my X, to get the ys. The "procedure" is not a pure function, but it's a stateful and complex thing, which you can imagine as proc = lambda inputs: model.transform(inputs), where model is some stateful model/object. I need to apply the transformations sequentially, one row after the other, because my problem and my model are inherently sequential.

Therefore my requirements are:

Does vaex support such workflow? (Which it's a pretty standard data science dataflow, except maybe for the contraint of being a sequential setting, which is less common than having settings with independent samples/rows).

If yes, how? And if the answer is "no", or "almost", what kind of hacks could one employ to achieve such a thing?

I can already tell you that for the returned nested structure I'm trying to serialize it with pickle.dumps to a bytestring that then I'll de-serialize when I need the Y again, and as a workaround to the necessity of passing all arguments explicitly to apply I created a procedure with a generic (*args) interface, and I'm passing the list of all columns from the dataframe dynamically, like:

def func(*args):
    # do something with args, knowing that the order of features is fixed
    return somethings
df.apply(func, arguments=list(df.columns.keys())

But maybe you can suggest better workarounds, or can suggest clean solutions that your API already implements.

My biggest problem so far is in understanding if it's possible to use apply sequentially (I get errors due to what is seems a multithreaded and concurrent applications of the procedure to the dataframe rows - maybe I'm wrong). And if not, whether there is another way to achieve the same result.

maartenbreddels commented 4 years ago

Perhaps a better title for my issue would be "column data types"

Basically all ndarray (numpy) types, except for dtype=object, which has limited support (cannot be serialized to disk) and Apache Arrow types (but not yet the latest release, but an alpha is coming soon).

I'm wondering why you need Vaex to do this, since all operations will be done outside of Vaex (your model code), Vaex will not speed up anything. The only thing you can gain from this is saving memory in the pre-processing or the auto pipelines.

In that case, I'd approach it similar to what we've done with wrapping incremental predictors for sklearn: https://github.com/vaexio/vaex/blob/48531b5d0ff3b8010809dc422f7e67555f0ad79b/packages/vaex-ml/vaex/ml/sklearn.py#L214

Basically get batches of arrays using evaluate_iterator. You could then feed these batches (of rows, if you do it one by one) to your model.

I hope this clarifies things a bit, and I hope the sklearn wrapper helps you wrap your own model.

Regards,

Maarten

mujina93 commented 4 years ago

Thank you @maartenbreddels!

The question about types is answered, thank you.

I'm wondering why you need Vaex to do this, since all operations will be done outside of Vaex (your model code), Vaex will not speed up anything. The only thing you can gain from this is saving memory in the pre-processing or the auto pipelines.

Yes you are right. I didn't specify that my requirement is exactly to be able to scale up in terms of memory, at the cost of an increased processing time. And I am trying several solutions to try to keep this processing time as low as possible.

In my application I access data with different patterns. Sometimes I need to work on columns (think preprocessing columns), other times I need random access of rows (think training on random horizontal chunks of the dataset - the rows are consecutive in a chunk, but the chunk may come from anywhere in the dataset).

It's difficult to find a solution that behaves optimally for such different access patterns, while still keeping the memory down. In the end I developed my own solution, but yesterday I found out about vaex, and I wanted to give it a try since it seems promising. (The solutions you empoy are similar to the ones I used, but I am much more comfortable relying on a well developed and supported project like yours. That's why I'd like to see if it's possible to use vaex for my use cases).

Regarding my need for going over rows in random chunks, it seems that the only performant way would be to bring the chunk in memory ("materialize it", I'm not sure if it's the right term within your framework) and then iterate over its rows. I'd try something like row slicing to access a chunk, and then bring the chunk into memory somehow.

How fast would be random slicing with respect to iterating on sequential chunks? I.e. doing df[start:end] with respect to using evaluate_iterator. I assume random access is as fast as it could get, since you are memory mapping files under the hood.

And what would be the advised approach to materialize a full dataframe (chunk)? I'm a bit confused as to what I should use among

Thank you again for your quick answers!

(Perhaps I should have split my different questions into different issues. I'm sorry if this seems a bit of a potpourri)

maartenbreddels commented 4 years ago

evaluate_iterator is a bit special because it will in parallel prepare the next chunk before it returns you the current chunk, this is basically why we managed to do https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385 sklearn is learning on a chunk, while vaex prepares the next chunk in a separate thread.

df[i1:i2] only makes a shallow copy, so it doesn't do much, besides some bookkeeping. df.materialize can be useful to get faster random access. After materializing, accessing the data like df['x'].values is free, it will just give you a reference to the numpy or arrow data.

DataFrame is the base class of both DataFrameLocal and DataFrameRemote. DataFrameLocal is the equivalent of panda's DataFrame.

(Perhaps I should have split my different questions into different issues. I'm sorry if this seems a bit of a potpourri)

No worries, I hope this sheds some light on Vaex, feel free to ask for more clarifications, as I may have missed some of your questions.

JovanVeljanoski commented 2 years ago

Closing as stale. Please re-open if needed.