posit-dev / positron

Positron, a next-generation data science IDE
https://positron.posit.co
Other
2.82k stars 89 forks source link

Positron lags when running a model on big data #4636

Open alkat19 opened 2 months ago

alkat19 commented 2 months ago

System details:

Positron and OS details:

Positron Version: 2024.09.0 (Universal) build 1 Code - OSS Version: 1.92.0 Commit: f37f4f5044a2a619e73d5db61a31e37fbd3faf18 Date: 2024-09-03T02:37:20.474Z (1 wk ago) Electron: 30.1.2 Chromium: 124.0.6367.243 Node.js: 20.14.0 V8: 12.4.254.20-electron.0 OS: Darwin arm64 23.6.0

Interpreter details:

R 4.4.1

Describe the issue:

Positron lags a lot when running a (glm) model in a data of 2mil rows.

Steps to reproduce the issue:

# Replicate the openintro::nycflights dataset to reach 2mil rows approx
library(openintro)
#> Loading required package: airports
#> Loading required package: cherryblossom
#> Loading required package: usdata
library(tidyverse)

dim(nycflights)
#> [1] 32735    16

dat <- rbind(nycflights,nycflights)
dat <- rbind(dat,dat)
dat <- rbind(dat,dat)
dat <- rbind(dat,dat)
dat <- rbind(dat,dat)
dat <- rbind(dat,dat)

# Create a meaningless binary outcome based on hour:
# data <- data %>% mutate(outcome = if_else(hour > 15, 0, 1))
dat <- dat |>
  mutate(outcome = if_else(hour > 15, 0, 1))

# Positron lags a lot when running a (glm) model in a data of 2mil rows
dim(dat)

# Run a random glm:
model <- glm(
  outcome ~ day + month + dep_time + air_time + distance,
  family = "binomial", 
  data = dat
)

Expected or desired behavior:

There should be no lags apart from the waiting time for the model to run, similar to RStudio or VSCode

Were there any error messages in the UI, Output panel, or Developer Tools console?

No

jennybc commented 2 months ago

I've restated your example in runnable code. Can you confirm this is accurate? (Replicating openintro::nycflights creates a data frame with 200,000 rows, not 2 million.)

Here's what I see which doesn't seem very surprising to me. Not a big lag while running the model.

https://github.com/user-attachments/assets/8fab7218-6afd-4f4b-bcd1-c05d3591d094

If you're having a really different experience, can you capture a screen recording?

jennybc commented 2 months ago

Here's the reprex:

# Replicate the openintro::nycflights dataset 6 times
library(openintro)
#> Loading required package: airports
#> Loading required package: cherryblossom
#> Loading required package: usdata
library(tidyverse)

dim(nycflights)
#> [1] 32735    16

dat <- rbind(
  nycflights, nycflights, nycflights,
  nycflights, nycflights, nycflights
)

# Create a meaningless binary outcome based on hour:
# data <- data %>% mutate(outcome = if_else(hour > 15, 0, 1))
dat <- dat |>
  mutate(outcome = if_else(hour > 15, 0, 1))

# Positron lags a lot when running a (glm) model in a data of 2mil rows and 18 columns.
dim(dat)
#> [1] 196410     17

# Run a random glm:
model <- glm(
  outcome ~ day + month + dep_time + air_time + distance,
  family = "binomial", 
  data = dat
)
#> Warning: glm.fit: algorithm did not converge
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# See the platform lagging (cannot scroll through code for a couple of seconds)

Created on 2024-09-11 with reprex v2.1.1.9000

alkat19 commented 2 months ago

I have updated my initial post with a code to reproduce it.

jennybc commented 2 months ago

Here's a new reprex. The system.time() result seen here (about 11 seconds) matches what I'm seeing in both Positron and RStudio. And I don't see any lags in either IDE other than the time spent running the model. So far, I'm not able to reproduce.

# Replicate the openintro::nycflights dataset 6 times
library(openintro)
#> Loading required package: airports
#> Loading required package: cherryblossom
#> Loading required package: usdata
library(tidyverse)

dim(nycflights)
#> [1] 32735    16

dat <- do.call(rbind, replicate(n = 64, nycflights, simplify = FALSE))

# Create a meaningless binary outcome based on hour:
# data <- data %>% mutate(outcome = if_else(hour > 15, 0, 1))
dat <- dat |>
  mutate(outcome = if_else(hour > 15, 0, 1))

# Positron lags a lot when running a (glm) model in a data of 2mil rows and 18 columns.
dim(dat)
#> [1] 2095040      17

# Run a random glm:
system.time(
model <- glm(
  outcome ~ day + month + dep_time + air_time + distance,
  family = "binomial", 
  data = dat
)
)
#> Warning: glm.fit: algorithm did not converge
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#>    user  system elapsed 
#>   9.387   1.384  10.837
# See the platform lagging (cannot scroll through code for a couple of seconds)

Created on 2024-09-11 with reprex v2.1.1.9000

alkat19 commented 2 months ago

https://github.com/user-attachments/assets/25dd3135-6fcd-49eb-81dc-dacf35bb9ca1

I cannot drag files more than 10 MB, unfortunately, so I need to make a short recording. You can see that I cannot scroll after the red button disappears. Also, other operations after that point lag as well. For example, doing the following operation after the model runs also lags, and I am again unable to scroll properly :


dat <- dat |>
  mutate(outcome2 = if_else(hour > 13, 0, 1))

Can you also provide a screen recording where I can see the ability to scroll through code seamlessly after the model is run?

jennybc commented 2 months ago

the ability to scroll through code seamlessly

OK this is interesting -- to look at scrolling specifically. I'm talking about scrolling in the R Console.

I do see peculiar scrolling behaviour once I've fitted the model (feeling stuck, feeling slow). It feels like that begins when the model object first populates into the SESSION pane (but then of course that also coincides with the model object existing in the first place). But the environment viewer (in, e.g., RStudio) has a lot of potential for doing unsavory things in the presence of a large object.

jennybc commented 2 months ago

Sounds similar to #4573

If I'm right about the variables pane, then maybe related to #2223

Maybe resembles #4008? Maybe connected to #2797?

alkat19 commented 2 months ago

The problem is that it becomes an issue when there are hundreds of code lines after the model fit since everything feels stuck or slow after that point. The behavior is not exhibited in RStudio or VSCode (which I currently use with Radian).

Thanks for looking into this.