swcarpentry / python-novice-gapminder

Plotting and Programming in Python
http://swcarpentry.github.io/python-novice-gapminder/
Other
162 stars 428 forks source link

Resolve error(s) in looping exercise solution. #670

Closed davidwilby closed 4 months ago

davidwilby commented 5 months ago

Hi, I think there is an error in the last exercise's solution in episode 14: Looping Datasets.

I don't think that this is associated with an open issue, sorry if I'm mistaken.

The last exercise "Comparing Data" has the following solution:

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`), 
    # and then remove the `.csv` extension from that string.
    region = filename.split('_')[-1][:-4] 
    dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()

However, this results in an error at in the call to dataframe.mean(), eg.:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 dataframe.mean()

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11666, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11658 @doc(make_doc("mean", ndim=2))
  11659 def mean(
  11660     self,
   (...)
  11664     **kwargs,
  11665 ):
> 11666     result = super().mean(axis, skipna, numeric_only, **kwargs)
  11667     if isinstance(result, Series):
  11668         result = result.__finalize__(self, method="mean")

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12413, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12406 def mean(
  12407     self,
  12408     axis: Axis | None = 0,
   (...)
  12411     **kwargs,
  12412 ) -> Series | float:
> 12413     return self._stat_function(
  12414         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12415     )

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/generic.py:12370, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12366 nv.validate_func(name, (), kwargs)
  12368 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12370 return self._reduce(
  12371     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12372 )

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11535, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11531     df = df.T
  11533 # After possibly _get_data and transposing, we are now in the
  11534 #  simple case where we can use BlockManager.reduce
> 11535 res = df._mgr.reduce(blk_func)
  11536 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11537 if out_dtype is not None and out.dtype != "boolean":

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/managers.py:1501, in BlockManager.reduce(self, func)
   1499 res_blocks: list[Block] = []
   1500 for blk in self.blocks:
-> 1501     nbs = blk.reduce(func)
   1502     res_blocks.extend(nbs)
   1504 index = Index([None])  # placeholder

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/internals/blocks.py:404, in Block.reduce(self, func)
    398 @final
    399 def reduce(self, func) -> list[Block]:
    400     # We will apply the function and reshape the result into a single-row
    401     #  Block with the same mgr_locs; squeezing will be done at a higher level
    402     assert self.ndim == 2
--> 404     result = func(self.values)
    406     if self.values.ndim == 1:
    407         res_values = result

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/frame.py:11454, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11452         return np.array([result])
  11453 else:
> 11454     return op(values, axis=axis, skipna=skipna, **kwds)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):
    723     count = cast(np.ndarray, count)

File ~/miniconda3/envs/carp-py/lib/python3.11/site-packages/pandas/core/nanops.py:1686, in _ensure_numeric(x)
   1683 inferred = lib.infer_dtype(x)
   1684 if inferred in ["string", "mixed"]:
   1685     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1686     raise TypeError(f"Could not convert {x} to numeric")
   1687 try:
   1688     x = x.astype(np.complex128)

TypeError: Could not convert ['AustraliaNew Zealand'] to numeric

This should be corrected by using index_col="country" when reading the csv.

Secondly, there is another error caused due to the continent column being present in the gapminder_gdp_americas.csv (but not any of the others).

I appreciate that these may be deliberate mistakes, though I suspect not since this is within the solution.

github-actions[bot] commented 5 months ago

:ok: Pre-flight checks passed :smiley:

This pull request has been checked and contains no modified workflow files, spoofing, or invalid commits.

It should be safe to Approve and Run the workflows that need maintainer approval.

vahtras commented 5 months ago

Thank you! This used to be perfectly fine because pandas would silently ignore numeric operations, such as df.mean(), for columns where it did not make sense. Apparently, the approach is now to transform all columns to numeric format first, and then calculate the mean. I don't think this is a good idea, because when this fails, we get the traceback with TypeError we see.

Setting the index column to 'country' works for some files but one of the files (americas) also has the continent column (All values the same). This could be an oversight in the generation of the datafile, since the regional tables now do not follow the same pattern.

So what to do? Your suggestion together with cleaning up the data files would work. On the other hand life is full of unclean data so a more realistic approach could be to filter out the 'gdp' columns.

Something like

import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
    dataframe = pd.read_csv(filename)
    region = filename.split('_')[-1][:-4]
    gdpdata = [c for c in dataframe.columns if c.startswith('gdp')]
    dataframe[gdpdata].mean().plot(ax=ax, label=region)
plt.legend()

which gives

gdp

Whoops, the x labels do not look good here either, they write into each other. We need to handle them somehow.

To summarise: great that you found this error! I myself haven't decided which solution is the best here and appreciate comments from other maintainers.

Olav

alee commented 5 months ago

Thank you for catching this @davidwilby and taking the time to document and submit a PR for this! Ditto @vahtras for the additional context.

I like the filtering approach and think it's useful. To muddy the waters even more here's two other ways we could do this:

  1. use dataframe.filter e.g.,

dataframe.filter(like="gdp").mean().plot(...)

to only compute the mean on columns with gdp in the name.

  1. set the numeric_only kwarg to True when we compute the mean, e.g.,

dataframe.mean(numeric_only=True).plot(...)

to only compute the mean on columns with numeric data

vahtras commented 5 months ago

Great options @alee! I did not know about filter...like. Both are fine with me, but I guess numeric_only is closest to the original solution, so I would prefer that in this case.

davidwilby commented 5 months ago

Glad to be of help!

Depending on how kind you want to be to learners, it may be worth including a note in the exercise description that there's a complication/gotcha/extra challenge in this one. Along with a comment in the solution pointing out the reason for this.

alee commented 4 months ago

closing this in favor of #674 - thanks!