tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 343 forks source link

read_html_live sessions do not close and accumulate causing memory to crash #422

Closed alireza5969 closed 2 weeks ago

alireza5969 commented 2 months ago

Dear {{tidyverse}} / {{rvest}} community,

I'm not sure if this is a bug or a problem that I can not find the solution for.

I try to read about 1000 pages with read_html_live() in a for loop. Naturally,I expect each page / session (I'm sorry if I'm not using the correct technical term) to be closed when a new one is called. However, after a while, when the machine has read 50-100 pages, the memory crashes.

When I look at task manage, I see all chrome is severely disrupting the memory (see image below). image

FYI, this is the code that I'm using:

for (i in strt_n:nrow(list)) {
    print(i)

    page <- NA
    attempts <- 0

    while (!is.environment(page)) {

        attempts <- attempts + 1
        page <- tryCatch({
            read_html_live(list[[i, "url"]])
        },
        error = function(e) {
            print("error")
            if (attempts %% 10 == 0) {beepr::beep()}
            Sys.sleep(3)
            return(NA)
        })
    }

    Sys.sleep(2)

    content <- page %>% 
        html_elements(".oneLineText__oneLineText____Igu4") %>% 
        html_text2()
}

Currently, my work around is this code, which I add it at the end of every 100 loops. But it makes the script very slow.


if (i %% 100 == 0) {
  system("taskkill /IM chrome.exe /F")
}
hadley commented 3 weeks ago

Do you see the problem with this simpler reprex?

library(rvest)

for (i in 1:100) {  
  page <- read_html_live("https://hadley.nz")
}

Does adding an explicit gc() at the end of each iteration make any difference?

alireza5969 commented 3 weeks ago

Thank you for your response. Yes, the issue is reproducible with the code you provided. No, gc() did not help with the issue. Here is a screenshot of the Windows Task Manager after 100 for loops: image

sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Tehran
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_1.0.4

loaded via a namespace (and not attached):
 [1] later_1.3.2       R6_2.5.1          fastmap_1.2.0     websocket_1.4.1   magrittr_2.0.3    glue_1.7.0        lifecycle_1.0.4   ps_1.7.6          xml2_1.3.6       
[10] promises_1.3.0    cli_3.6.3         processx_3.8.4    chromote_0.3.1    compiler_4.3.3    httr_1.4.7        rstudioapi_0.16.0 tools_4.3.3       Rcpp_1.0.12      
[19] rlang_1.1.4       jsonlite_1.8.8   
hadley commented 3 weeks ago

I bet this is going to be a windows specific problem 😞

wch commented 3 weeks ago

@alireza5969 In the latest screenshot, I see it says "Google Chrome (86)". I don't have a Windows machine handy -- does that mean there are 86 tabs/windows open? It may be counting your regular (visible) tabs in that number.

Also, I think 1.7GB is not actually a lot of memory for Chrome to consume when you have multiple tabs open.

If that 86 does represent the number of open tabs and you do not have that many visible open tabs, then it may be the case that rvest is opening many tabs and not closing them right away.

Can you check if that number is the number of open tabs -- does it increase when you open a new tab? And also check how many visible tabs you have, as opposed to the invisible headless ones created by rvest/chromote.

hadley commented 3 weeks ago

@alireza5969 could you please try installing pak::pak("r-lib/rvest#429"), restarting R, and then seeing if the problem goes away?

alireza5969 commented 3 weeks ago

@wch

In the latest screenshot, it shows "Google Chrome (86)." I don’t have access to a Windows machine right now—does this mean there are 86 tabs/windows open? It might be counting your regular (visible) tabs in that figure.

To be honest, I’m not entirely sure what this indicates! After a fresh session following a restart, when I visit this page with Chrome, I see varying counts (like 14 or 22). When I open a new tab (for instance, google.com), the number jumps to between 19 and 27. So, I suspect it’s not accurately reflecting active or visible tabs.

Also, I think that 1.7GB isn’t a lot of memory usage for Chrome with multiple tabs open.

I agree, it’s not. However, it’s currently using 73% of my memory (and I actually have decent RAM!). But for my tasks, I sometimes need to scrape over 5K webpages! That’s when it really becomes a concern.

rvest is opening multiple tabs and not closing them immediately.

Yes, I believe that’s the case.

Does the memory usage increase when you open a new tab?

Yes, it goes up with the number of open tabs (or potentially with the workload).

All the examples above were with Chrome, without using rvest.


I'm sorry, @hadley! I can’t install pak::pak("r-lib/rvest#429") because I’m encountering the following error:

Error:                                     
! error in pak subprocess
Caused by error: 
! Could not solve package dependencies:
* r-lib/rvest#429: ! pkgdepends resolution error for
r-lib/rvest#429.

But, I was able to install it with this one: pak::pak("tidyverse/rvest#429")

This is what it looks like when I run:

for (i in 1:100) {  
  print(i)
  page <- read_html_live("https://hadley.nz")
}

image

I think you did it @hadley 👏🏻😌 Thanks a lot!

hadley commented 2 weeks ago

Oops, sorry for the wrong org name, and thanks for verifying that the fix works!