Closed grantmcdermott closed 1 year ago
Very interesting, this does not ring a bell to me
If it was the $collect()-method behaving strangely, I would immediately start debugging the concurrency bridge between single-threaded R and multi-threaded polars.
Since it already happens at scan_parquet() which is a quite slim wrapper rust-polars scan_parquet it is most likely between rust-polars and the vs code R IDE.
I suppose the file globbing makes rust-polars make a lot of consecutive file reads. Maybe the IDE is interacting with the files leading to some contention for the OS filesystem. Maybe you could try place the files somewhere outside the IDE workspace to not allow this.
@grantmcdermott I can sometimes but not always replicate the problem (elapsed ~2.5s vs. 0.003s). At first, I got slow results when I started the terminal "interactively" using the R extension, by executing a first R command from a script. Then, I tried to open a fresh terminal window manually and pasted the code in. That was very fast. But now when I source("myscript.R")
, I get fast results, even when I open the terminal with R extension within VS code.
Not sure what to think of all this...
Thanks both.
@vincentarelbundock Yeah, that's an additional layer of weirdness. FWIW the source("myscript.R)
trick doesn't work for me. It's still just as slow.
Tagging @renkun-ken in case he has any ideas.
EDIT: @sorhawell, sorry I forgot to mention two things in reply to your comment: 1) Calling files from outside of the project workspace still runs into the same problem. 2) I haven't tried yet with native rust-polars. But I don't hit a slowdown when executing equivalent commands with py-polars in VS Code's interactive Python environment.
All things considered, I think your contention race idea is the most likely culprit. But how or why it would get triggered through the R Extension I have no idea.
@grantmcdermott It could also be some of the previous LazyFrame S3 methods that were implicitly collecting, which may have been triggered by the IDE. That would explain why immediately collect was fast enough.
I suspect that the internal working of vscode-R session watcher does not handle the "uncollected" object well as it uses task callbacks to write the structure of the object on each top-level user input.
@grantmcdermott Would you like to take a look at file.size(.vsc$workspace_file)
and see if the file is quite large once nyc
is obtained?
Would you like to use system.time()
as in
system.time(nyc <- scan_parquet("/Users/gmcd/taxi-data/*/*/*.parquet"))
to measure the time so that we could know if it is the task callback after this expression that takes much time.
To narrow down the cause, would you like to disable session watcher and see if the slowdown still exists?
Let me take a closer look at the uncollected object and see if I could reproduce it somehow.
BTW, how many rows and columns are there in nyc
, @grantmcdermott?
I could reproduce the problem in vscode with the following code:
Prepare some data in parquet files:
n <- 1e6
df <- data.frame(id = seq_len(n))
df$group <- df$id %% 100
for (i in 1:100) {
df[[paste0("x", i)]] <- runif(n)
}
arrow::write_dataset(df, "data", partitioning = "group")
Use arrow::open_dataset()
:
t0 <- proc.time()
system.time(df1 <- arrow::open_dataset("data"))
(toc = proc.time() - t0)
user system elapsed
0.160 0.011 0.185
user system elapsed
0.520 0.054 0.621
Use polars::scan_parquet()
:
t0 <- proc.time()
system.time(df2 <- polars::scan_parquet("./data/*/*.parquet"))
(toc = proc.time() - t0)
user system elapsed
0.337 0.002 0.368
user system elapsed
19.693 2.917 2.616
The problem seems to be that calling names()
on polars's LazyFrame
object is very expensive:
> system.time(names(df1))
user system elapsed
0.000 0.000 0.001
> system.time(names(df2))
user system elapsed
4.351 0.343 0.421
And the timing persists each time. This mostly explains why a significant delay can be observed after each user input, since each LazyFrame
in the global environment will take a significant amount of time to respond to each call of names()
, and vscode-R session watcher just calls it on every global object on each top-level user input.
Super, thanks for digging deeper @renkun-ken. I'm at the airport right now and might not be able to test more until the weekend is over. But happy to troubleshoot / test further once I'm back.
FYI, the names.LazyFrame()
method used to collect the whole data automatically to check column names. This happened on every names()
call, so that may explain the difference in timing and general slugishness.
However, a commit from yesterday removed names.LazyFrame()
altogether. Calling names(df2)
will now return NULL
instantly if the data is uncollected. (Thanks to @eitsupi for anticipating this potential problem.)
vscode-R session watcher won't be able to decipher the structure of these objects, but it should no longer spend much time trying...
Fix confirmed with the latest series of commits. Execution time is equivalent (equally fast) whether it is run from VS Code or directly from the terminal. Thanks all!
Has anyone else experienced this? I've confirmed that the same thing is happening for me on both Mac and Linux (Arch).
I first noticed it when scanning a large number of parquet files. However, I don't think the number of files is a problem per se. It's rather the total size. For the below example, I'm scanning 1 month of NYC taxi data that I downloaded from AWS S3 like so:
Now, let's compare timings for a simple scan operation.
First, running through a regular R terminal:
Second, running through VS Code:
(So, a ~650x slowdown).
What's really weird is that this slowdown disappears if I collect immediately. (Which you'd obviously think would be more expensive.) Here's the same chunk again from VS Code, except I chain
$collect()
at the end of the call.(Huh? Slowdown disappears.)
Some additional notes:
It's not just scanning itself: Computation operations are also affected if they are applied to scanned-but-not-yet-collected objects. E.g. running
nyc$groupby("passenger_count")$agg(pl$col("fare_amount")$mean())$collect()
adds another 10 seconds to the VS code chunk, but a fraction of a second for the R terminal on.The problem appears to get increasingly worse with the size of datasets. E.g. Downloading another month's worth of taxi data and scanning from VS code more than doubles the wait time. In my RL application, operations that should normally take <= 1 sec are taking > 1 min.
Looking at
htop
while this is all going on shows that my memory swap partition is being crushed (completely used up) and all cores are active throughout.Another odd thing is that the slowdown only occurs if I'm sending code from a script (i.e., using
cmd return
to execute the highlight lines). If I manually open up an integrated terminal inside VS Code, start R and run the same commands, there's no problem and it runs at normal speed.I've confirmed that the problem is not dependent on the type of R terminal (e.g., default or radian).
I'm stumped. Any ideas? I'm also happy to move this to the VS Code R Extension if that's where it belongs. But would be good to have this audience confirm first.