Closed mitar closed 7 years ago
This is not in-scope for pandas 1.x, it could be done in a sub-class.
Virtually all operations return new objects. Simply don't use inplace
flags, nor do in-place indexing and you have de-facto immutability.
you can also use pandas.util.hash_pandas_object
to make data hashes as well.
Oh. :-( As a subclass it is pretty tricky, because you have to make sure you shadow over Pandas methods which can potentially change internals. Hashing is also not enough. It can tell you that something changed, but not prevent changing.
Is there a way to tell pandas to create a pandas object using a subclass?
@mitar @jreback @TomAugspurger I know of this third party package addressing this issue: static-frame. Do you know other packages as well?
Nope.
On Fri, Apr 24, 2020 at 7:55 AM Florian Kromer notifications@github.com wrote:
@mitar https://github.com/mitar @jreback https://github.com/jreback @TomAugspurger https://github.com/TomAugspurger I know of this third party package addressing this issue: static-frame https://github.com/InvestmentSystems/static-frame. Do you know other packages as well?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16567#issuecomment-618991559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITVIQ7OYO54HHQAYR3ROGD5NANCNFSM4DNSVC5A .
Is there is also no tooling support (e.g. pylint extension) to ensure people don't introduce bugs into functionality which takes a dataframe as input, manipulates the dataframe and output the manipulated dataframes by mistake?
For functions a workaround could be to implement a decorator for dynamic analysis: It would have to check which args
and kwargs
are dataframes or series. For e.g. a single dataframe it would have to calculate the hash of the input dataframe df_in
before and after the wrapped function is called (df_in_hash_before = pd.util.hash_pandas_object(df_in, index=True)
, df_in_hash_after = pd.util.hash_pandas_object(df_in, index=True)
) and assert if the hashes differ (pd.testing.assert_series_equal(input_df_hash, output_df_hash)
).
Yes, I gave up on this. I find it really sad because Pandas is almost there. Many methods have in_place
argument and it would be great if you could just enforce this to be False
and prevent any other modifications. Getting a copy every time by design (when enabled).
To get this into pandas
would be too optimistic I guess. I'm thinking about to implement the decorator and publishing it in a package pytest-pandas
. This would allow to add dynamic anaysis of "mutability conformity" during tests.
It seems currently there is no option similar to numpy's setflags to make pandas dataframe completely immutable (writeable=false). We are considering a design where we use immutability to know that we can cache objects. While we can still design a system like that, it would be great if we could enforce immutability to catch any errors.