Open bcottman opened 3 years ago
@bcottman thank you for chiming in!
I had a quick scan through your article. It seems like there might be existing APIs that support some of the things you're thinking of. For example, changing types is available under the .change_type()
function.
Reviewings PRs takes some time and headspace, so the smaller the PR the easier it is to review. From what you've written, it sounds like there might be a large amount of PRs coming our way. Would you be kind enough to help us out a bit - could you comment below on what you're planning to PR in order of priority, and possibly identify if there's an existing function that could be improved in lieu of adding a new subpackage? In terms of priority, I would suggest prioritizing something truly non-existing instead.
I also noticed the use of camel case in your article. Please watch out for those - we stick to snake case in pyjanitor!
Yes. I will review the existing pyjanitor API (I did already, I must have missed it) and write PR for each function.
Before I write N (where N > 20) PRs for some additions, I would like to get feedback on the changes I have detected in order to fit in your package.
Need to transform from -> to
========================
camel-> snake
docstring format google -> original style (not sure what to call this style. original sphinx?)
package error janitor_Error -> JanitorError
(YEAH!) pytest -> pytest
from ericmji: "improved in lieu of adding a new subpackage?"
reply: bcottman: I think you may want more than one new subpackage. For example spark. most of my functions do fall into one of four categories (1) cleaners, which probably integrate with pyjanitor functions, I will look. (2) changing any data type into categorical (sd == structured data). why? getting ready for entity embedding encoding (3) scalers - arguable part of scikit-learn (4) encoders - arguable part of scikit-learn
p.s. Willing to create a series for pyanitor on medium.
Thanks, @bcottman!
With respect to the style, yes, I think you've identified the most important ones. If there are other smaller ones that are either automatically detected by our continuous integration checks or are uncovered when we do code review, I hope you don't mind changing them too :smile:.
I looked again at your article on TDS, and I think there may be some good functions to prioritize. If you're okay with it, I'd like to propose the following order of PRs:
datetime_components
(a proposed name for toDatetimeComponents
) inside functions.py
. This one is not yet covered by pyjanitor
and I think would be valuable!bin_numeric
such that one can bin a numeric column into arbitrary percentiles. Providing that general case would provide a base for later adding on keywords for binning by decile
, quartile
, and quintile
. The following look like they have overlaps inside the pyjanitor
library already, which means I think you might not need to submit a full PR for a new function.
bin_numeric
looks like it maps to toContinuousCategory
, but only naively. As mentioned above, I think there could be an improvement PR there.change_type(column_name, pd.Categorical)
, I think, does what toCategory
intends to do.change_type(column_name, int)
should be able to do binary_value_to_integer
.pd.DataFrame
constructor should be able to cover to_DataFrame
, unless there's a use case I wasn't thinking of.clean_names
with the truncate parameter, I think, covers toColumnNameFixedLen
That said, if you see improvements that could be made, we should talk about them.
What are your thoughts on my proposal above? If you're agreeable to them, I'd love to review the two PRs, one for datetime_components
and one for improvements to bin_numeric
.
face value proposal looks fine. run unto any problems, will PR on this thead.
Brief Description
I would like to propose... contains also # Example API
https://towardsdatascience.com/six-datatype-transformer-functions-for-data-pre-processing-for-machine-learning-eb9abcce68cd
I will be running contrib gauntlet. I have about 10 more. I will look periodically to see if you merged. You send me feedback, if you want before that.