jakevdp / PythonDataScienceHandbook

Python Data Science Handbook: full text in Jupyter Notebooks
http://jakevdp.github.io/PythonDataScienceHandbook
MIT License
43.34k stars 17.96k forks source link

Should births_by_date multiplied by 2? #83

Open pp611 opened 7 years ago

pp611 commented 7 years ago

In Chapter 3's "Pivot Tables section" and Chapter 4's "Text and Annotation" section, when computing the births by date using:

  births_by_date = births.pivot_table('births', [births.index.month, births.index.day])

Should each value be multiplied by 2 since male and female births are counted on separate rows? So should it be:

  births_by_date = (births.pivot_table('births', [births.index.month, births.index.day])) * 2 

instead?

jakevdp commented 7 years ago

Thanks for the comment! I think you're right that it's off by 2, but a clearer way to accomplish that would be to use aggfunc='sum' rather than the default aggfunc='mean'.

pp611 commented 7 years ago

Just changing aggfunc seems not right either, it should group by dates first, sum up the two rows of each day's male and female births then take the aggfunc='mean'. I could not figure out how to make a DataFrame object out of births.groupby(['year', 'month', 'day']).sum() with the same year/month/day index. So making the whole births_by_date times 2 seems the easiest way. Mathematically it is explainable, conceptually it could be better.