pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.35k stars 169 forks source link

[INF] Tracking progress of pyspark submodule #547

Open ericmjl opened 5 years ago

ericmjl commented 5 years ago

@zjpoh, wanted to check with you, to what extent would you like to build out the spark submodule before you'd be comfortable with a release being put out?

zjpoh commented 5 years ago

I would prefer to go through the functions in janitor.functions then list all the functions that makes sense to add to the pyspark submodule. Then we can decide on which one should be on the next release.

@ericmjl What do you think about this?

@anzelpwj please let me know on what you think about this also. Thanks~

ericmjl commented 5 years ago

This makes sense to me! Let's go with the plan. :smile:

zjpoh commented 5 years ago

Here are the functions that I think is reasonable to add. Items with ?? are items that I don't exactly understand what it is doing by reading just the doc.

For the next version, I do not have a strong preference on what needs to be on it. I'm okay with just having the current version plus making pyspark optional. However, I do think that then is very useful because with that, users can chain any methods but I'm not sure if it will work in a distributed system.

Modify columns

Modify values

Filtering

Preprocessing

Other

ericmjl commented 5 years ago

@zjpoh on request from a colleague, I'm going to release 0.18.2 soon (nothing seems to be backwards-incompatible, so it's just a patch release).

I'm going to rename this issue to track the pyspark module progress. Having thought a bit more, I think it's okay to keep gradually releasing the pyspark additions and then at some release point, just make a "big splash" announcement on Twitter and the README. Hope you're ok with that?