pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
405 stars 35 forks source link

scan_parquet very slow with large datasets from VS Code #135

Closed grantmcdermott closed 1 year ago

grantmcdermott commented 1 year ago

Has anyone else experienced this? I've confirmed that the same thing is happening for me on both Mac and Linux (Arch).

I first noticed it when scanning a large number of parquet files. However, I don't think the number of files is a problem per se. It's rather the total size. For the below example, I'm scanning 1 month of NYC taxi data that I downloaded from AWS S3 like so:

mkdir -p taxi-data/year=2012
aws s3 cp s3://ursa-labs-taxi-data/2012/01 taxi-data/year=2012/month=01 --recursive

Now, let's compare timings for a simple scan operation.

First, running through a regular R terminal:

library(polars)

tic = proc.time()
nyc = scan_parquet("/Users/gmcd/taxi-data/*/*/*.parquet")
(toc = proc.time() - tic)
#>  user  system elapsed 
#> 0.009   0.003   0.012

Second, running through VS Code:

library(polars)

tic = proc.time()
nyc = scan_parquet("/Users/gmcd/taxi-data/*/*/*.parquet")
(toc = proc.time() - tic)
#>   user  system elapsed 
#> 29.834  11.322   7.774 

(So, a ~650x slowdown).

What's really weird is that this slowdown disappears if I collect immediately. (Which you'd obviously think would be more expensive.) Here's the same chunk again from VS Code, except I chain $collect() at the end of the call.

library(polars)

tic = proc.time()
nyc = scan_parquet("/Users/gmcd/taxi-data/*/*/*.parquet")
(toc = proc.time() - tic)
#>  user  system elapsed 
#> 5.593   1.953   0.893 

(Huh? Slowdown disappears.)

Some additional notes:

I'm stumped. Any ideas? I'm also happy to move this to the VS Code R Extension if that's where it belongs. But would be good to have this audience confirm first.

sorhawell commented 1 year ago

Very interesting, this does not ring a bell to me

If it was the $collect()-method behaving strangely, I would immediately start debugging the concurrency bridge between single-threaded R and multi-threaded polars.

Since it already happens at scan_parquet() which is a quite slim wrapper rust-polars scan_parquet it is most likely between rust-polars and the vs code R IDE.

I suppose the file globbing makes rust-polars make a lot of consecutive file reads. Maybe the IDE is interacting with the files leading to some contention for the OS filesystem. Maybe you could try place the files somewhere outside the IDE workspace to not allow this.

vincentarelbundock commented 1 year ago

@grantmcdermott I can sometimes but not always replicate the problem (elapsed ~2.5s vs. 0.003s). At first, I got slow results when I started the terminal "interactively" using the R extension, by executing a first R command from a script. Then, I tried to open a fresh terminal window manually and pasted the code in. That was very fast. But now when I source("myscript.R"), I get fast results, even when I open the terminal with R extension within VS code.

Not sure what to think of all this...

grantmcdermott commented 1 year ago

Thanks both.

@vincentarelbundock Yeah, that's an additional layer of weirdness. FWIW the source("myscript.R) trick doesn't work for me. It's still just as slow.

Tagging @renkun-ken in case he has any ideas.

EDIT: @sorhawell, sorry I forgot to mention two things in reply to your comment: 1) Calling files from outside of the project workspace still runs into the same problem. 2) I haven't tried yet with native rust-polars. But I don't hit a slowdown when executing equivalent commands with py-polars in VS Code's interactive Python environment.

All things considered, I think your contention race idea is the most likely culprit. But how or why it would get triggered through the R Extension I have no idea.

sorhawell commented 1 year ago

@grantmcdermott It could also be some of the previous LazyFrame S3 methods that were implicitly collecting, which may have been triggered by the IDE. That would explain why immediately collect was fast enough.

renkun-ken commented 1 year ago

I suspect that the internal working of vscode-R session watcher does not handle the "uncollected" object well as it uses task callbacks to write the structure of the object on each top-level user input.

@grantmcdermott Would you like to take a look at file.size(.vsc$workspace_file) and see if the file is quite large once nyc is obtained?

Would you like to use system.time() as in

system.time(nyc <- scan_parquet("/Users/gmcd/taxi-data/*/*/*.parquet"))

to measure the time so that we could know if it is the task callback after this expression that takes much time.

To narrow down the cause, would you like to disable session watcher and see if the slowdown still exists?

renkun-ken commented 1 year ago

Let me take a closer look at the uncollected object and see if I could reproduce it somehow.

renkun-ken commented 1 year ago

BTW, how many rows and columns are there in nyc, @grantmcdermott?

renkun-ken commented 1 year ago

I could reproduce the problem in vscode with the following code:

Prepare some data in parquet files:

n <- 1e6
df <- data.frame(id = seq_len(n))
df$group <- df$id %% 100
for (i in 1:100) {
  df[[paste0("x", i)]] <- runif(n)
}

arrow::write_dataset(df, "data", partitioning = "group")

Use arrow::open_dataset():

t0 <- proc.time()
system.time(df1 <- arrow::open_dataset("data"))
(toc = proc.time() - t0)
   user  system elapsed 
  0.160   0.011   0.185 
   user  system elapsed 
  0.520   0.054   0.621

Use polars::scan_parquet():

t0 <- proc.time()
system.time(df2 <- polars::scan_parquet("./data/*/*.parquet"))
(toc = proc.time() - t0)
   user  system elapsed 
  0.337   0.002   0.368 
   user  system elapsed 
 19.693   2.917   2.616

The problem seems to be that calling names() on polars's LazyFrame object is very expensive:

> system.time(names(df1))
   user  system elapsed 
  0.000   0.000   0.001 

> system.time(names(df2))
   user  system elapsed 
  4.351   0.343   0.421 

And the timing persists each time. This mostly explains why a significant delay can be observed after each user input, since each LazyFrame in the global environment will take a significant amount of time to respond to each call of names(), and vscode-R session watcher just calls it on every global object on each top-level user input.

grantmcdermott commented 1 year ago

Super, thanks for digging deeper @renkun-ken. I'm at the airport right now and might not be able to test more until the weekend is over. But happy to troubleshoot / test further once I'm back.

vincentarelbundock commented 1 year ago

FYI, the names.LazyFrame() method used to collect the whole data automatically to check column names. This happened on every names() call, so that may explain the difference in timing and general slugishness.

However, a commit from yesterday removed names.LazyFrame() altogether. Calling names(df2) will now return NULL instantly if the data is uncollected. (Thanks to @eitsupi for anticipating this potential problem.)

vscode-R session watcher won't be able to decipher the structure of these objects, but it should no longer spend much time trying...

grantmcdermott commented 1 year ago

Fix confirmed with the latest series of commits. Execution time is equivalent (equally fast) whether it is run from VS Code or directly from the terminal. Thanks all!