nanne-aben / strictly_typed_pandas

MIT License
85 stars 7 forks source link

Suggestion: exclude ClassVar from schema #226

Closed marc-precisionlife closed 2 days ago

marc-precisionlife commented 4 days ago

Hi there,

I'm using strictly typed pandas and one thing I've encountered is that it's not possible to declare class variables, for example the below raises an error:

from strictly_typed_pandas import DataSet
from typing import ClassVar

class A:
    a: int
    b: ClassVar[str] = 'abc'

ds = DataSet[A]({'a': [1,2,3]})

Would it be possible to exclude classvars from the schema? Looking at the code it seems pretty doable, either by adding to the dataset init or schema validation.

something like the below:

from typing import ClassVar, get_origin
import inspect

class A:
    a: int
    b: ClassVar[str] = 'abc'

ann = inspect.get_annotations(A, eval_str=True)

ann = {
    var_name: var_type
    for var_name, var_type in ann.items() if get_origin(var_type) is not ClassVar
}

This would mean if someone tries to define a ClassVar and member variable the result would depend on the order of declaration but... Don't.

I'll try to make a PR this evening. Otherwise very useful module.

nanne-aben commented 3 days ago

Hi @marc-precisionlife,

If I run your code:

from strictly_typed_pandas import DataSet
from typing import ClassVar

class A:
    a: int
    b: ClassVar[str] = 'abc'

ds = DataSet[A]({'a': [1,2,3]})

I get the following error:

TypeError: Schema contains the following columns not present in data: {'b'}

Arguably, that's not wrong. The schema does indeed contain a b variable that's not present in the DataSet.

I see you argue that basically any ClassVar should be ignored. Could you give a bit more context as to why you'd like to do that? I'm wondering if there's an alternative way in which we could solve this that doesn't require us outright ignoring class variables.

alwaysmpe commented 3 days ago

Hi (same user, personal account),

There are a few things, mostly convenience/simplicity/self documenting code. We're working with multiple datasets, each dataset is made up of a reasonably large (~20) number of schemas/files. Using it to add a classvar of the associated file name plus a mapping from file column name to loaded name. So I would like to be able to do something like:

from strictly_typed_pandas import DataSet
from typing import ClassVar
imort inspect
import pandas as pd

def load_typed_ds[T](
        path: Path,
        name_map: dict[str, str],
        loaded_type: type[T],
) -> Dataset[T]:
    mro = inspect.getmro(loaded_type)
    annotations = {}
    for cls in mro:
        annotations |= inspecct.get_annotations(cls, eval_str=True)
    read_types = {
        csv_name: annotations[load_name]
        for csv_name, load_name in name_map.items()
    }

    input_df = pd.read_csv(
        path,
        dtype=read_types,
    ).rename(columns=name_map)
    return DataSet[loaded_type](input_df)

class AData:
    file_name: ClassVar[str] = 'a_data.csv'
    name_map: ClassVar[dict[str, str]] = {
        'foo': 'bar',
    }
    bar: int

def load_a_for_dataset(path: Path) -> DataSet[AData]:
    data_path = path / AData.file_name
    return load_typed_ds(
        data_path,
        AData.name_map,
        AData,
    )

Class properties were removed in python 3.11, so that's not an option. It's possible to just declare that data outside of the class, but it's a bit more verbose. Doing the above you could just reduce it to a single function that takes a path and a type.

The closest parallel in the standard library is dataclasses which work as I've suggested:

from dataclasses import dataclass
from typing import ClassVar

@dataclass
class AData:
    file_name: ClassVar[str] = 'a_data.csv'
    name_map: ClassVar[dict[str, str]] = {
        'foo': 'bar',
    }
    bar: int

print(AData.file_name)
a = AData(bar=4)
print(a.bar)

produces

a_data.csv
4

While DataSets can't work in exactly the same way as dataclasses under python's current type system, they're not dissimilar.

nanne-aben commented 3 days ago

Right, if dataclasses also exclude these ClassVar cases, that does make it more compelling. Also, there's no risk of backward compatibility issues, since any column annotated with ClassVar will currently throw an error (as it doesn't map to a pandas type). Let's add this indeed! :)

nanne-aben commented 2 days ago

Thanks for your contribution! I've merged the PR to the main branch.

I can't make a new release at the moment, something is wrong with the release pipeline... https://github.com/nanne-aben/strictly_typed_pandas/actions/runs/9668117860/job/26671541650

I'll look into it later... If you need it now, you could install it directly from the main branch on github.

alwaysmpe commented 2 days ago

Great, I'm using Poetry to manage my package, I think your repository layout confuses it a bit as it's not sure whether the repository contains typeguard or strictkly-typed-pandas so I'll probably wait till the release. I did have a look at the error but I've got no idea so probably best to leave to you 😆

nanne-aben commented 2 days ago

Turns out just running the pipeline again solved the issue 🙈 I've just made a new release, if you update your version of strictly_typed_pandas, it should contain your contribution now!