tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 340 forks source link

read_html_live() memory "leak" #408

Open rdelrossi opened 3 months ago

rdelrossi commented 3 months ago

I'm experimenting with read_html_live() using Vivaldi as the headless browser (I try to avoid installing Google Chrome).

It works great. But, I've noticed that it's leaving some Chromate cruft behind. Each time I run my R script, the macOS activity monitor shows a new instance of the "Vivaldi Helper (Renderer)" that costs about 500 MB of memory and actually grows from there. Run the script too many times, and, naturally the whole computer grinds to a halt. When I force-kill the processes in the activity monitor, the R console report "[error] handle_read_frame error: websocketpp.transport:7 (End of File)"

I'm not sure if I'm missing some kind of clean-up step, or if this is a Vivaldi problem, of if this is an rvest bug, but wanted to let you know, @hadley.

Later: I've noticed that pairing payload <- rvest::read_html_live(url) with payload$session$close() addresses the problem I'm describing (i.e., the "Vivaldi Helper (Renderer)" disappears from he Activity Monitor). Apologies if I missed the need for doing this in the docs.

hadley commented 3 months ago

If you rm(payload), then the garbage collector should close down the process a bit later.

rdelrossi commented 3 months ago

I'll do that, thanks.

rcepka commented 2 months ago

Just want to add my own experience. I am on Windows 10 and using read_html_live() caused gradual memory consumption ultimately until computer crash. With each new page loaded with read_html_live() I could watch the new Chrome task within the RStudio group in the Task Manager app and memory usage raising up to the level of consumption of all computer memory. Deleting process did not help, I used both methods mentioned in posts above: `rm(page)

page$session$close() ` Finally I ended using the Selenider package to do the job. I found this to be an excellent tool. It is even automatically closing the session after returning from the function I created to load the web page and extract the data from it.

hadley commented 2 months ago

@rcepka can you provide any more details?