rsundqvist / id-translation

Turn meaningless IDs into human-readable labels.
MIT License
0 stars 0 forks source link

Translating as `pandas.Categories` #225

Open rsundqvist opened 5 months ago

rsundqvist commented 5 months ago

The function below works but is limited.

import pandas as pd
from id_translation.offline import TranslationMap

def translate_as_categories(df: pd.DataFrame, tmap: TranslationMap) -> pd.DataFrame:
    from id_translation.dio import resolve_io

    dtypes = {
        # sort_index() to ensure ordering by ID
        column: pd.CategoricalDtype(pd.Series(tmap[column]).sort_index(), ordered=True) 
        for column in df
    }
    return resolve_io(df).insert(df, names=list(df), tmap=tmap, copy=False).astype(dtypes)

Not very convenient though, and requires some knowledge of internal id_translation types.

  1. Setup

    >>> data = {1999: "Sofia", 1991: "Richard"}
    
    >>> from id_translation import Translator
    >>> translator = Translator({"people":  data})
    >>> translator
    Translator(online=False: cache=TranslationMap('people': 2 IDs))
  2. Create data

    >>> df = pd.Series(list(data)).to_frame("people")
    >>> df = df.sample(4, replace=True).reset_index(drop=True)
    >>> df.T
    people  1999  1999  1991  1999
  3. Apply

    
    >>> df = translate_as_categories(df, translator.cache)
  4. Result

    >>> df.T
    people  1999:Sofia  1999:Sofia  1991:Richard  1999:Sofia
    
    >>> df["people"].dtype
    CategoricalDtype(categories=['1991:Richard', '1999:Sofia'], ordered=True, categories_dtype=object)

Maybe it's enough to put up at documentation/examples.

rsundqvist commented 5 months ago

Issues with naïve solution