vincentarelbundock / pycountrycode

GNU General Public License v3.0
3 stars 3 forks source link

Numpy and Pandas functionality for `codelist` #3

Closed frank113 closed 11 months ago

frank113 commented 1 year ago

Summary

One of the most compelling features of the R package countrycode is the ability to manually manipulate the codelist Dataframe for use in other projects. As presently constructed the codelist variable exported from countrycode does not integrate with pandas or numpy. More specifically the following script will fail when countrycode is installed via pip:

from countrycode import codelist

codelist.to_pandas()
codelist.to_numpy()

Potential Fixes

  1. Do not update dependencies and update documentation to indicate that numpy and pandas functionality requires installation of those packages
  2. Update primary project dependencies in pyproject.toml to include pandas and numpy.
  3. Add a data dependencies section that includes numpy and pandas. With this approach a user can optionally install the additional packages:
pip install countrycode[data]
  1. Convert codelist to a pandas.DataFrame and remove substitute polars for pandas

Of the potential options I am partial to 3 and 4. Option 3 leaves the structure of the package untouched and shifts the choice to install additional dependencies to the user if they wish to use the codelist data in a pandas or numpy environment. My predilection for option 4 stems from python's reliance on pandas to manipulate data.

Considerations

Other

Your statement in the last issue about "real work" resonated with me.

vincentarelbundock commented 1 year ago

That's a very good point!

A bit on my personal background: I basically quit doing statistical analysis in Python 10 years ago, mainly because I disliked pandas so much. The reason I'm doing these porting projects now is that I wanted an excuse to learn polars (which looks really great thus far).

All this to say that I would be very hesitant to re-write the internals using pandas. However, documenting as you describe in option 1, and adding data dependencies as you describe in Option 3 seem like excellent (and easy) ideas.

towr commented 1 year ago

If possible I'd opt to not have a hard dependency on anything. The functionality of this package isn't so complex it really needs dependencies like polars or pandas. For my use cases it would almost certainly be better to use plain python dicts, because those are darn fast for lookups and don't have the overhead of moving into C or rust. But having convenience functions to get a pandas or polars dataframe would be nice for people more comfortable with that. So my votes would also be for option 1 and 3, leaving pandas optional.

vincentarelbundock commented 1 year ago

@towr I agree.

I am overloaded at work right now, so it will probably take a while before I have time to make this change.

I would be extremely happy to review a PR if anybody has the energy to make Pandas and Polars optional.

(If optional, it would be nice if the behaviour felt natural: the function returns the same kind of object as the input automatically.)

vincentarelbundock commented 11 months ago

Thanks again for raising this issue.

In version 0.4.0 (on pypi now), the pandas and polars dependencies are optional.