okfn-brasil / serenata-toolbox

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **
MIT License
154 stars 69 forks source link

Reimbursements cleaner tuning #191

Closed viniciusartur closed 6 years ago

viniciusartur commented 6 years ago

What is the purpose of this Pull Request?

Improve reimbursements cleaning stage performance. Running this stage in Rosie for chamber_of_deputies took 7 minutes instead 53 minutes.

What was done to achieve this purpose?

I've changed the pandas function responsible for aggregation of values. Changing pandas.core.groupby.GroupBy.apply(), considered slower than other methods, to pandas.core.groupby.DataFrameGroupBy.agg().

How to test if it really works?

Run Rosie (rosie.py run chamber_of_deputies) using both pandas functions and compare the output generated (reimbursements-.csv files) and measure time of this stage.

Who can help reviewing it?

@cuducos @Irio @vitorbaptista

viniciusartur commented 6 years ago

I agreed with your points. Thanks for reviewing.