Open ericmjl opened 5 years ago
I would prefer to go through the functions in janitor.functions
then list all the functions that makes sense to add to the pyspark submodule. Then we can decide on which one should be on the next release.
@ericmjl What do you think about this?
@anzelpwj please let me know on what you think about this also. Thanks~
This makes sense to me! Let's go with the plan. :smile:
Here are the functions that I think is reasonable to add. Items with ?? are items that I don't exactly understand what it is doing by reading just the doc.
For the next version, I do not have a strong preference on what needs to be on it. I'm okay with just having the current version plus making pyspark optional. However, I do think that then
is very useful because with that, users can chain any methods but I'm not sure if it will work in a distributed system.
expand_column
deconcatenate_column
limit_column_characters
: ??row_to_names
: ??clean_names
join_apply
: ??find_replace
~: Not needed. Can use .selectExpr("REPLACE(...)")
or .selectExpr("REGEXP_REPLACE(...)")
directly.round_to_fraction
: Use .selectExpr("ROUND(...)")
if rounding to decimal point, otherwise need to implement.update_where
: Already method chainable with .selectExpr("CASE WHEN cond THEN val ELSE col END AS col")
but might make sense to add a function to generate the above string when cond, val, col
are given.dropnotnull
get_dupes
bin_numeric
encode_categorical
impute
label_encode
min_max_scale
get_features_targets
then
@zjpoh on request from a colleague, I'm going to release 0.18.2 soon (nothing seems to be backwards-incompatible, so it's just a patch release).
I'm going to rename this issue to track the pyspark
module progress. Having thought a bit more, I think it's okay to keep gradually releasing the pyspark
additions and then at some release point, just make a "big splash" announcement on Twitter and the README. Hope you're ok with that?
@zjpoh, wanted to check with you, to what extent would you like to build out the spark submodule before you'd be comfortable with a release being put out?