Closed marc-precisionlife closed 5 months ago
Hi @marc-precisionlife,
If I run your code:
from strictly_typed_pandas import DataSet
from typing import ClassVar
class A:
a: int
b: ClassVar[str] = 'abc'
ds = DataSet[A]({'a': [1,2,3]})
I get the following error:
TypeError: Schema contains the following columns not present in data: {'b'}
Arguably, that's not wrong. The schema does indeed contain a b
variable that's not present in the DataSet
.
I see you argue that basically any ClassVar
should be ignored. Could you give a bit more context as to why you'd like to do that? I'm wondering if there's an alternative way in which we could solve this that doesn't require us outright ignoring class variables.
Hi (same user, personal account),
There are a few things, mostly convenience/simplicity/self documenting code. We're working with multiple datasets, each dataset is made up of a reasonably large (~20) number of schemas/files. Using it to add a classvar of the associated file name plus a mapping from file column name to loaded name. So I would like to be able to do something like:
from strictly_typed_pandas import DataSet
from typing import ClassVar
imort inspect
import pandas as pd
def load_typed_ds[T](
path: Path,
name_map: dict[str, str],
loaded_type: type[T],
) -> Dataset[T]:
mro = inspect.getmro(loaded_type)
annotations = {}
for cls in mro:
annotations |= inspecct.get_annotations(cls, eval_str=True)
read_types = {
csv_name: annotations[load_name]
for csv_name, load_name in name_map.items()
}
input_df = pd.read_csv(
path,
dtype=read_types,
).rename(columns=name_map)
return DataSet[loaded_type](input_df)
class AData:
file_name: ClassVar[str] = 'a_data.csv'
name_map: ClassVar[dict[str, str]] = {
'foo': 'bar',
}
bar: int
def load_a_for_dataset(path: Path) -> DataSet[AData]:
data_path = path / AData.file_name
return load_typed_ds(
data_path,
AData.name_map,
AData,
)
Class properties were removed in python 3.11, so that's not an option. It's possible to just declare that data outside of the class, but it's a bit more verbose. Doing the above you could just reduce it to a single function that takes a path and a type.
The closest parallel in the standard library is dataclasses which work as I've suggested:
from dataclasses import dataclass
from typing import ClassVar
@dataclass
class AData:
file_name: ClassVar[str] = 'a_data.csv'
name_map: ClassVar[dict[str, str]] = {
'foo': 'bar',
}
bar: int
print(AData.file_name)
a = AData(bar=4)
print(a.bar)
produces
a_data.csv
4
While DataSets can't work in exactly the same way as dataclasses under python's current type system, they're not dissimilar.
Right, if dataclasses also exclude these ClassVar
cases, that does make it more compelling. Also, there's no risk of backward compatibility issues, since any column annotated with ClassVar
will currently throw an error (as it doesn't map to a pandas type). Let's add this indeed! :)
Thanks for your contribution! I've merged the PR to the main branch.
I can't make a new release at the moment, something is wrong with the release pipeline... https://github.com/nanne-aben/strictly_typed_pandas/actions/runs/9668117860/job/26671541650
I'll look into it later... If you need it now, you could install it directly from the main branch on github.
Great, I'm using Poetry to manage my package, I think your repository layout confuses it a bit as it's not sure whether the repository contains typeguard
or strictkly-typed-pandas
so I'll probably wait till the release. I did have a look at the error but I've got no idea so probably best to leave to you 😆
Turns out just running the pipeline again solved the issue 🙈 I've just made a new release, if you update your version of strictly_typed_pandas, it should contain your contribution now!
Hi there,
I'm using strictly typed pandas and one thing I've encountered is that it's not possible to declare class variables, for example the below raises an error:
Would it be possible to exclude classvars from the schema? Looking at the code it seems pretty doable, either by adding to the dataset init or schema validation.
something like the below:
This would mean if someone tries to define a
ClassVar
and member variable the result would depend on the order of declaration but... Don't.I'll try to make a PR this evening. Otherwise very useful module.