vincentarelbundock / pycountrycode

GNU General Public License v3.0
3 stars 3 forks source link

polars and pandas optionalization #6

Closed LamAdr closed 12 months ago

LamAdr commented 12 months ago

@vincentarelbundock

I defined custom GenericDataframe objects containing a data attributes which is a polars or pandas dataframe, or a python dict. Priority is to polars when available. Please tell me if you would prefer it not to be using objects.

If you like it, here are two things I would appreciate your input on:

  1. If an interesting part of the package is to be able to manipulate the codelist data directly, maybe it would be good to let the user decide which package to use for the data attribute when both are available?
  2. how should I modify the tests that presently rely on polars for dataframe manipulation. I see two options:
    • adding methods to GenericDataframe to uniformize testing. This means adding stuff unnecessary to users
    • rewriting these tests so that they only use built-in dictionaries

Thanks a lot

vincentarelbundock commented 12 months ago

Sorry I was not clear in explaining what was needed! I'm afraid I made you lose a bit of your time and effort...

It is not really useful to operate on different versions of codelist per se. If the user wants to do that, they can just call pd.read_csv() on our CSV directly and do whatever they want. It feels like introducing a new class and different attributes for different types of data frames is overkill and will just add code complexity for little tangible benefit.

I think that what we need is to modify the countrycode function such that it:

  1. Records the kind of input that the user supplies (polars series, pandas series, or python list)
  2. Converts that input to a simple python list, on which we operate internally.
  3. Does all the conversion needed (either through regex or simple matching). Currently, this is done in Polars, but we want to get rid of that dependency and use basic Python commands. Operating on lists should be fast enough for our needs. This mostly involves changing the replace_exact() and replace_match() functions. For example, here we use map_dict(). That's a Polars method but we want to use a base Python function instead. Same here
  4. Convert the list back to the same kind of object that the user had supplied at the start. We already do something similar here.

That way, the user gets the same kind of object that they supplied --- so no surprises. Also, there is only one "conversion engine" that we use internally, which keeps the pycountrycode codebase simple.

Internally, maybe we want to read the codelist CSV file into a dictionary where keys represent the codename and values are lists of values, all of equal length. That way, we can match codes by position.

I expect this to be a pretty small PR with relatively few lines of code changed.

Is that clearer?

LamAdr commented 12 months ago

Ok yeah I get it now. Sorry about that, I will make another PR.