pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.8k forks source link

Lock files while scanning #11018

Open stinodego opened 11 months ago

stinodego commented 11 months ago

Problem description

As mentioned in #11006

Altering a file while scanning it may result in unexpected behavior. We should not allow writing to a file that is being scanned.

avimallu commented 11 months ago

When will the file lock be released by Polars in either eager or lazy mode?

I see plenty of use cases where I run code that saves a file to disk, loads it back via Polars in the same script throwing errors when debugging the code. Does that mean Polars will not be able to save the file with the same name in the same session because it was read earlier?

cjackal commented 11 months ago

Another thought: File lock does not make sense for remote objects (which scan_* allows as an input), making another functional discrepancy b/w scan operations on local / remote files. (i.e. whether pl.scan_parquet(path) raises depend on the location of path)

ritchie46 commented 11 months ago

It lives as long as the query runs. You should not write to the file you read.

In any case we must do something smart here. We cannot lock all the files in a query plan or the OS might complain we locked too many. We already had this in the past and then I removed the locks. Maybe we should lock the first/last file in a union created by a glob match.

avimallu commented 11 months ago

It lives as long as the query runs.

Does that mean that the case in the linked issue:

pl.DataFrame({'a' : ['1'], 'b' : ['1']}).write_parquet('tmp.parquet')
df = pl.scan_parquet('tmp.parquet')
pl.DataFrame({'b' : ['2'], 'a' : ['2']}).write_parquet('tmp.parquet')
df.select('a').collect()

will not throw an error? Is the query considered running the moment scan_parquet is used? Won't it prevent another scan_parquet to another DataFrame read?