Closed alireza5969 closed 2 weeks ago
Do you see the problem with this simpler reprex?
library(rvest)
for (i in 1:100) {
page <- read_html_live("https://hadley.nz")
}
Does adding an explicit gc()
at the end of each iteration make any difference?
Thank you for your response.
Yes, the issue is reproducible with the code you provided.
No, gc()
did not help with the issue.
Here is a screenshot of the Windows Task Manager after 100 for
loops:
sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Asia/Tehran
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_1.0.4
loaded via a namespace (and not attached):
[1] later_1.3.2 R6_2.5.1 fastmap_1.2.0 websocket_1.4.1 magrittr_2.0.3 glue_1.7.0 lifecycle_1.0.4 ps_1.7.6 xml2_1.3.6
[10] promises_1.3.0 cli_3.6.3 processx_3.8.4 chromote_0.3.1 compiler_4.3.3 httr_1.4.7 rstudioapi_0.16.0 tools_4.3.3 Rcpp_1.0.12
[19] rlang_1.1.4 jsonlite_1.8.8
I bet this is going to be a windows specific problem 😞
@alireza5969 In the latest screenshot, I see it says "Google Chrome (86)". I don't have a Windows machine handy -- does that mean there are 86 tabs/windows open? It may be counting your regular (visible) tabs in that number.
Also, I think 1.7GB is not actually a lot of memory for Chrome to consume when you have multiple tabs open.
If that 86 does represent the number of open tabs and you do not have that many visible open tabs, then it may be the case that rvest is opening many tabs and not closing them right away.
Can you check if that number is the number of open tabs -- does it increase when you open a new tab? And also check how many visible tabs you have, as opposed to the invisible headless ones created by rvest/chromote.
@alireza5969 could you please try installing pak::pak("r-lib/rvest#429")
, restarting R, and then seeing if the problem goes away?
@wch
In the latest screenshot, it shows "Google Chrome (86)." I don’t have access to a Windows machine right now—does this mean there are 86 tabs/windows open? It might be counting your regular (visible) tabs in that figure.
To be honest, I’m not entirely sure what this indicates! After a fresh session following a restart, when I visit this page with Chrome, I see varying counts (like 14 or 22). When I open a new tab (for instance, google.com), the number jumps to between 19 and 27. So, I suspect it’s not accurately reflecting active or visible tabs.
Also, I think that 1.7GB isn’t a lot of memory usage for Chrome with multiple tabs open.
I agree, it’s not. However, it’s currently using 73% of my memory (and I actually have decent RAM!). But for my tasks, I sometimes need to scrape over 5K webpages! That’s when it really becomes a concern.
rvest is opening multiple tabs and not closing them immediately.
Yes, I believe that’s the case.
Does the memory usage increase when you open a new tab?
Yes, it goes up with the number of open tabs (or potentially with the workload).
All the examples above were with Chrome, without using rvest
.
I'm sorry, @hadley! I can’t install pak::pak("r-lib/rvest#429")
because I’m encountering the following error:
Error:
! error in pak subprocess
Caused by error:
! Could not solve package dependencies:
* r-lib/rvest#429: ! pkgdepends resolution error for
r-lib/rvest#429.
But, I was able to install it with this one: pak::pak("tidyverse/rvest#429")
This is what it looks like when I run:
for (i in 1:100) {
print(i)
page <- read_html_live("https://hadley.nz")
}
I think you did it @hadley 👏🏻😌 Thanks a lot!
Oops, sorry for the wrong org name, and thanks for verifying that the fix works!
Dear
{{tidyverse}}
/{{rvest}}
community,I'm not sure if this is a bug or a problem that I can not find the solution for.
I try to read about 1000 pages with
read_html_live()
in a for loop. Naturally,I expect each page / session (I'm sorry if I'm not using the correct technical term) to be closed when a new one is called. However, after a while, when the machine has read 50-100 pages, the memory crashes.When I look at task manage, I see all chrome is severely disrupting the memory (see image below).
FYI, this is the code that I'm using:
Currently, my work around is this code, which I add it at the end of every 100 loops. But it makes the script very slow.