Open thatlittleboy opened 2 years ago
@thatlittleboy your thoughts on encapsulation to enforce order sound like the right thing to do.
I'd admit I'm not so well-versed in the adorn_*
family of functions in janitor
, so I'll hold off on commenting on their specific behaviour. That said, I am in favour of adding in janitor
functionality into pyjanitor
, and I'm also in favour of your way of thinking about how to organize the functions in a sane fashion too. :smile:
Great, thanks for the affirmation @ericmjl . I'll have a think about the desired API and propose something in a PR when I'm ready. :)
Hello, My name is Sabrina, and I’m excited about the opportunity to contribute to the pyjanitor project. I have been exploring it and found several issues that align with my skills. I would love to be assigned to one or more issues, starting by this one. Please let me know how I can help.
Thank you
Hi @Sabrina-Hassaim, welcome! I am going to tag @samukweku, he’s been super active here as a core contributor to pyjanitor and has more context than I. Meanwhile, can I ask, what are your goals for contributing? Want to see how we can best support you as you make your contributions!
Hello @ericmjl, thank you for your response. I’m currently working on an academic project where I need to contribute to open-source projects by resolving issues. Given my background in data analysis and my experience with libraries like Pandas, I found it fitting to contribute to PyJanitor, as it aligns with my skill set.
hi @Sabrina-Hassaim please feel free to contribute; i suggest you have a look at the development guide. looking forward to your PR.
Brief Description
There are a few
adorn_*
functions from R's janitor that are not yet ported over to pyjanitor. Janitor docs here.I'm specifically looking at:
adorn_totals
: adds a "total" column to either the rows, the columns, or bothadorn_percentages
: converts the cell values into percentages, calculated along either axis or over the entire dataframe. In the R formulation, these are floats between 0 and 1, not the 0-100 percentages.adorn_pct_formatting
: formats the 0 to 1 values into the 0 to 100 percentage values, with rounding/formatting optionsadorn_ns
: adds the raw counts back into the cell values (meant to be run afteradorn_percentages
), so each cell has both percentage & count info, like "56 (24.3%)" for example.I imagine these might be particularly useful for those doing data reporting. These should go into the
functions
module.Example API
In pyjanitor, I don't think having four separate functions work (how to enforce that
adorn_ns
comes afteradorn_percentages
? and where would we get the counts required foradorn_ns
? etc.).Perhaps we could just do a
adorn_totals
, and anadorn_percentages
(which encapsulates the behaviour ofadorn_pct_formatting
andadorn_ns
as well, controlled via function parameters).adorn_totals
This function should mirror the R function almost 1-1.
A few points I disagree(?) with the R implementation:
na.rm
parameter for this, but I somehow feel this isn't necessary.where
parameter, as defined by the R implementation, is to dictate whether to add a Totals "row" or "col"; as opposed to doing the summation over "row"/"col". In the latter case,where="row"
would add a new column containing the Totals across the rows (which to me is more natural). I'm calling this parameteraxis
here btw.adorn_percentages
TBD. Let me have a little think about this over the weekend, I decided against my own implementation idea while writing out the example API.. ><
Original idea
```python >>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df a b 0 6.0 x 1 NaN y 2 2.5 z >>> df.adorn_percentages( ... subset=None, # similar to `adorn_totals` ... axis='col', # similar to `adorn_totals` ... adorn_count=True, ... count_position='front', # ignored if adorn_count=False ... count_format=0, # ignored if adorn_count=False ... percentage_format=2, ... ) a b 0 6 (70.59%) x 1 nan y 2 3 (29.4%) z ``` Parameters: - `count_position`: whether to do front=="56 (23.4%)", back=="23.4% (56)" - `count_format` / `percentage_format`: if int, then represents the number of decimal places to round to. otherwise a string format specification like ':,.2f' or whatever. I'm not that sold on this API yet. Doesn't look too clean / friendly to use. After all, it is an amalgamation of 3 different behaviours in 1 function 😅). Would be happy to hear comments / suggestions to improve, if any.