Open tabassco opened 1 year ago
Thanks for the suggestion! I agree with the risk, I probably wouldn't want to make this default behavior...
I do realize it's a pretty common pattern though: you do a bunch of transformations, and then you want to subset to the columns of a schema and cast the type to DataSet[Schema].
In my other package (typedspark: typing for pyspark), we've addressed this with a transform_to_schema(df, schema) function, which does exactly that.
It also can take a third optional argument, which is a dictionary of transformations (i.e. assign() transformations, which we found we needed to do pretty often as well).
This is the documentation https://typedspark.readthedocs.io/en/1.0.19/transforming_datasets.html
Do you think something like this be interesting to include in strictly_typed_pandas as well? Does this address your question?
Sorry for leaving this dormant for so long. Has this been picked up in the meantime?
I'd think your proposal is the right way to go about this and would like to contribute the PR.
No worries! Sorry for my late reply as well, I was on vacation :)
I haven't worked on this, but I'd welcome your contribution! Feel free to make a PR. Let me know if you need any help!
Proposal
When passing a
DataFrame
into aDataSet
where theDataFrame
forms a superset of the columns mandated in theDataSet
it would be nice to be able to automatically drop the columns for the new object.Example:
Risk
One might still want to know if a passed
DataFrame
contained more than the expected columns.