pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.54k stars 1.88k forks source link

Make numpy arrays immutable after passing it as zero-copy to Polars #17593

Open coastalwhite opened 2 months ago

coastalwhite commented 2 months ago

Description

Following discussion in the Issue Triage. We would make numpy arrays immutable if they are passed as zero-copy to Polars.

An example can be seen here: https://stackoverflow.com/a/5541452

stinodego commented 2 months ago

I wasn't at the issue triage, but this is not the right approach.

pl.from_numpy(array) should not change the properties of array. Just like df.to_numpy() should not change the properties of df.

We should add the parameter combo allow_copy / writable or some other API to allow users control whether or not a copy is made. But changing the array properties is a no-go in my opinion.

ritchie46 commented 2 months ago

I strongly disagree here. If we set the flag and it is a problem for you, you got an error. Which is good, because you have a bug in your query. Data mutating on a distance. It causes all kinds of problems. Our statistics are then wrong and you get nonsense results.

The solution: copy the data on input. Then we don't set a flag. There is no valid reason you want to keep a write flag to data you send zero copy into another program.

So setting the flag will only cause problems if there are problems. Catching problems is good.

BTW: we can restore the flag once Polars doesn't memory map the data anymore.

stinodego commented 2 months ago

If that type of memory safety is an issue, we can detect writable inputs and either raise ("set to non-writable first") or make a copy, with some parameter(s) to control this behavior.

Setting flags on user inputs and then resetting them later at a later point in the program are side effects that are going to lead to hard-to-detect bugs. That's spooky action at a distance that we really shouldn't want.

ritchie46 commented 2 months ago

That means we always have to copy or always raise on input on a numpy array as numpy arrays are typically always writeable. I want first class numpy support and zero-copy without users requiring to set flags.

Setting a flag from writeable to readably goes from a less constraint condition to a more constraint condition. The only side effect is is that you cannot write for the duration that Polars holds the lock. It is a logical mutex. If you do you get an error raised, not invalid data, not invalid results.

In a general case, I agree with you, but you cannot write to data you borrow and I think this a case where we have to be practical.