nanne-aben / strictly_typed_pandas

MIT License
85 stars 7 forks source link

Automatically reduce DataFrames #64

Open tabassco opened 10 months ago

tabassco commented 10 months ago

Proposal

When passing a DataFrame into a DataSet where the DataFrame forms a superset of the columns mandated in the DataSet it would be nice to be able to automatically drop the columns for the new object.

Example:

import pandas as pd
from strictly_typed_pandas import DataSet

class Subset(DataSet):
    id: int
    name: str

df = pd.DataFrame({
    "id": [1, 2],
    "name": ["Hans", "Peter"],
    "age": [10, 10]
})

ds = DataSet[Subset](df)

Risk

One might still want to know if a passed DataFrame contained more than the expected columns.

nanne-aben commented 10 months ago

Thanks for the suggestion! I agree with the risk, I probably wouldn't want to make this default behavior...

I do realize it's a pretty common pattern though: you do a bunch of transformations, and then you want to subset to the columns of a schema and cast the type to DataSet[Schema].

In my other package (typedspark: typing for pyspark), we've addressed this with a transform_to_schema(df, schema) function, which does exactly that.

It also can take a third optional argument, which is a dictionary of transformations (i.e. assign() transformations, which we found we needed to do pretty often as well).

This is the documentation https://typedspark.readthedocs.io/en/1.0.19/transforming_datasets.html

Do you think something like this be interesting to include in strictly_typed_pandas as well? Does this address your question?

tabassco commented 1 month ago

Sorry for leaving this dormant for so long. Has this been picked up in the meantime?

I'd think your proposal is the right way to go about this and would like to contribute the PR.

nanne-aben commented 1 month ago

No worries! Sorry for my late reply as well, I was on vacation :)

I haven't worked on this, but I'd welcome your contribution! Feel free to make a PR. Let me know if you need any help!