stefmolin / Hands-On-Data-Analysis-with-Pandas-2nd-edition

Materials for following along with Hands-On Data Analysis with Pandas – Second Edition
https://www.amazon.com/Hands-Data-Analysis-Pandas-visualization/dp/1800563450
MIT License
577 stars 1.38k forks source link

Potentially incorrect solution to Exercise 10.d of Chapter 4 #41

Closed mvirbicianskas closed 1 year ago

mvirbicianskas commented 1 year ago

Required attestation


Hi,

First of all, thank you for the brilliant book! I think I've come across a particularly tricky exercise solution. I've wrecked my brain over this, but the solution provided in the exercise seems to be incorrect?

The code:

exercise_solution = covid19\
    .pivot(columns='countriesAndTerritories', values='cases')\
    .drop(columns='China')\
    .apply(lambda x: x[x > 0].index.min())\
    .sort_index()\
    .rename(lambda x: x.replace('_', ' '))

reality_check = covid19[["countriesAndTerritories", "cases"]]\
    .query("countriesAndTerritories == 'China' and cases == 0")\
    .sort_index()

exercise_solution, reality_check.index

The result:

image

The interpretation

The exercise solution shows the Afghanistan having the 2020-02-25 date as the first day where China had 0 Covid cases? But running the "reality check" query yields no such date. I might be misinterpreting the exercise and the solution? Sorry, English is not my first language, please clarify if that is the case!

Running subsequent checks, i.e.:

covid19.loc["2020-02-25"].query("countriesAndTerritories == 'China' or countriesAndTerritories == 'Afghanistan'")

yields such result: image Where we can see that China on that particular day actually had over 500 cases reported.

The solution?

covid[
    covid.index.isin(covid.query("countriesAndTerritories == 'China' and cases == 0").index)
    & covid.cases > 0]\
        .loc[:, ["countriesAndTerritories"]]\
        .reset_index()\
        .groupby("countriesAndTerritories")\
        .date.min()

image Note about the solution, the result I get is wildly different from what the book solution has provided. In a nutshell here I slice the dataset into having only days that had 0 cases in China AND with other countries reporting 1 or more cases. I'm pretty sure there's more elegant solution, I'm just starting out with python and pandas.

Please let me know if I'm going crazy or whether I'm onto something? I'd be happy to spend some more time and open a PR with the correct solution given the maintainers and authors of the book approve it 🤚

stefmolin commented 1 year ago

Hi @mvirbicianskas - it appears that you have misinterpreted what the exercise was asking you to do. Rather than looking at the first time that other countries had cases while China didn't, the exercise is asking you to find the first time each country had a case (regardless of the number of cases China had that day) and then to exclude China from that result. So for example, we want the first day that Afghanistan had cases (2020-02-25), the first day that Albania had cases (2020-03-09), etc. In other words, there should be no dependence on the cases China reported for that day.

mvirbicianskas commented 1 year ago

@stefmolin, I just re-read the exercise once more, I'm terribly sorry, I'm must've been tired or hung up on my understanding of the exercise that I couldn't see past it. Thank you for clarifying!