pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.37k stars 171 forks source link

Auto-wrapper for vaex? #1026

Open ericmjl opened 2 years ago

ericmjl commented 2 years ago

Brief Description

I recently read the vaex docs and it looks quite promising for highly scalable dataframe computation. I'd like to kickstart a discussion on what it might take to support vaex with pyjanitor.

Wanted to also make explicit that we don't have to decide on yes/no for this idea!

The vaex docs are available here. Because a lot of our underlying codebase operates on the pandas API, and vaex is supposed to be dataframe API compatible, it appears to me that we should be able to automagically wrap functions in our top-level API and have them "just work" under the df.func namespace.

thatlittleboy commented 2 years ago

I'm no expert on Vaex (had it in my sights for a while now but never had the time to explore it much); On a general note, there are a couple of things I see pertinent to make pyjanitor work with vaex:

Let me know if I'm off-course on this! Overall, on first glance, I'm not opposed to the idea, just want to flesh out first, how well will this change gel together with existing functionality of pyjanitor.

A nice first step I think would be to have a working POC for a very simple pyjanitor function (say, shuffle? or also/then?) implemented for a generic Vaex df. The ideal scenario would be to have the pyjanitor API identical between the two dataframes; just that only a subset of functions will be supported for Vaex dfs.

ericmjl commented 2 years ago

All-round great pointers, @thatlittleboy!

Yes, I agree a small prototype might be a good starting point. I might take my time on this one, as it is fairly low-priority in the grand scheme of things; our effort on #972 is currently more important.

On the specific questions you raised:

We need a way to add accessors / methods on the Vaex DataFrame in the same way that pandas_flavor does for pandas DataFrame. So that the end-user can do df.clean_names().remove_empty() and chain pyjanitor methods, regardless of whether they are using a Vaex or pandas df. There is this section on "Extending Vaex" in the docs, but it doesn't seem relevant (?)

vaex's extension API will register functions under the df.func namespace, rather than the df namespace. ("namespace" is a rather generic term here, I guess.) If we don't like the df.func.<some_function> API, we may need to implement our own wrapper, like what's done for the Spark API.

How about the completeness of the Vaex API compared to pandas? I'm not sure if this is covered somewhere in the docs (e.g. I know libraries like dask/modin/koalas have some notion of "API coverage", I wonder what the number is for vaex.) I ask this because for some pyjanitor functions, we rely heavily on specific pandas functions (factorize, cut etc.).

Not sure here either. I guess a prototype done using the most idiosyncratic pandas' functions would be the way to know!

thatlittleboy commented 2 years ago

From what I see, the df.func.<somefunction> using @register_function seems to only accept expressions / arrays, which seems overly restrictive for what we're trying to do in pyjanitor. E.g. pyjanitor functions that work on a group of user-defined columns. Can Vaex's register_function decorator accept a function like def func(*args)?

Ah. maybe the dataframe accessors might work.. worth a shot regardless. I can't quite tell just by looking 😄

Agreed on the point on the need to wrap if we somehow get the Vaex extensions to work; I'm strongly of the opinion we need to keep the internal (pyjanitor) API consistent, regardless of the DataFrame type.