ropensci / RSelenium

An R client for Selenium Remote WebDriver
https://docs.ropensci.org/RSelenium
341 stars 81 forks source link

Chromedriver stalls after a few hundred pageloads #250

Open chriscarrollsmith opened 3 years ago

chriscarrollsmith commented 3 years ago

Windows 10 Home 64 bit

RSelenium package version 1.7.7 installed through CRAN

My Chrome version is 91.0.4472.106, but I am using 91.0.4472.101 version of chromedriver installed through wdman

If I run a prolonged scrape (a few hundred page loads), my script will simply stall. No error message, no timeout, just... hangs until I manually stop the process. It seems like RSelenium is somehow losing its connection to the driver, because I am thereafter unable to do anything with the driver, including close it, short of closing RStudio and starting a new session. The problem is apparently with chromedriver rather than RSelenium, because I have switched to geckodriver/FireFox, and that combo is working fine. This problem arose in just the last couple weeks. It began immediately after I had to change my "chromever" parameter in the rsDriver() function, presumably because of an auto-update of either chromedriver or Chrome. Am not personally looking for an immediate fix, because I got geckodriver working fine. But thought I'd put this here as a starting point for troubleshooting should others experience the same issue.

chriscarrollsmith commented 3 years ago

Additional information on this error:

It seems to be a more or less random number of page loads. This error can occur relatively quickly, after just a few pageloads.

The error occurs on version 91.0.4472.114 of chromedriver as well as on 91.0.4472.101.

Attempting to close the driver returns this "chrome not reachable" error:

Selenium message:chrome not reachable (Session info: chrome=91.0.4472.114) Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'DESKTOP-GIPJE7Q', ip: '10.193.89.149', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_291' Driver info: driver.version: unknown

chriscarrollsmith commented 2 years ago

Now experiencing a similar issue with geckodriver. Problem seems to become more frequent with continued use. Makes me wonder if there's some kind of accumulating problem on the backend, like a cache filling up a little bit every time I use the driver.

mlane3 commented 2 years ago

@chriscarrollsmith I think you are onto something. Not sure if this would help you today, but I did want to at least mention something for people who have a similar issue. The short answer...

Problem seems to become more frequent with continued use. Makes me wonder if there's some kind of accumulating problem on the backend, like a cache filling up a little bit every time I use the driver.

The problem is exactly what you think it is. You are overfilling *them ()**. Chrome has some faults in its cache and memory limits. And you can only run so many selenium scripts at the same time with the task scheduler before you have to start manually killing ports. Fulton County SPMO team's best guess is that it is around 8-16 scripts with can be run simultaneously before errors start happening. We have a monstrosity of VM that is too big to fit in a docker container (its 1 TB of disk space total and yes we clean 25 to 100 Gb daily). I shall break this up into several posts to go over the three situations.

(*)duration the script runs and waiting for the page to load, the number of available ports for java.exe processes, chrome drivers memory, chrome driver cache, and the disk space of where your driver is stored.

P.S. If you're wondering why the heck an organization would want a 1 TB docker container, no one talks about the price of free open data.

mlane3 commented 2 years ago

Fixing waiting for the page to load and fixing the duration the script runs

You need to add wait time after each command in Selenium using Sys.sleep()

RSelenium still uses I believe Selenium 2 (or 3). So it does not automatically detect page loads as that is a newer feature. Typically with scripts, we automatically add in sleep Sys.sleep(3) after every line. With the first page load, we do about _Sys.sleep(10)__ Because of the random nature, I suspect that this is the primary reason your script is erroring out.
Generally in most web scraping adding in sleep commands is a best practice.

Why do you need sleep commands? If no one has ever told you... the truth is that most fraud and illegal activity of a corporation, nonprofit, or government cannot be found in an API or open dataset. You have to web scrape and collect it from webpages or documents. The problem is most organizations go to great lengths to protect themselves from web scraping and automation. Even people like me who have legal backing to web scrape their websites have to make sure the bots are not detected. So you have to slow down your bot. There is a good lecture by a professor at my old university about the topic that you can find on his course page: https://poloclub.github.io/cse6242-2022fall-campus/

The total test run time cannot exceed 5 hours and 45 minutes. If you going to have a longer script break up the code into two scripts.

So you want to know the downside of big data? Having to wait 6 hours to download 100 Gb files of constantly changing data then remove all the sensitive information and provide it to the general public as a small 20 Mb file. Needless to say, my organization has found that while you can make each line of your selenium script take as long as you want, you generally should not have a test that lasts longer than 5 hours.

mlane3 commented 2 years ago

Fixing errors with "the number of available ports for java.exe/chrome.exe processes"

Selenium message:chrome not reachable

This is the error message of a run-away(*) selenium web-driver that you would have to close using "Task Manager" or PowerShell in windows. Anytime there is a network outage or someone decided to upgrade an IT or what-not, you have to make sure it closes. In my organization, it is a cyber security risk if any automated process does not automatically close after erroring out.

clean_up <- function(driver) {
   # driver = the Selenium/Chrome driver that has been saved as a variable in your script's Global Environment. 
   # Usually called by using docker or rsDriver()
  driver$client$quit()      # close browser (terminates chrome.exe and chromedriver.exe)
  child_proc <- ps_children(driver$server$process$as_ps_handle())  
  for (p in child_proc) {   # ensures Windows processes spawned by selenium server are terminated (java.exe and conhost.exe)
    ps_kill(p)
    ps_is_running(p)
  }
  driver$server$stop()      # stop Selenium server (terminates cmd.exe)
  driver$server$process$kill_tree()
  rm(driver)
  gc()
}
get_free_port() <- function(test_ports=seq(1000, 2000, 3000)) {
  used_ports_df <- data.table::fread("netstat -aon -p tcp", skip = 1, header = FALSE)
  used_ports <- as.numeric(sub(".*:", "", used_ports_df$V2))
  free_ports <- setdiff(test_ports, used_ports)
  return(as.integer(free_ports[1]))
}
  1. Every instance of RSDriver uses get_free_port() so there are no conflicts between two RSelenium Scripts.
  2. We run every Selenium line in a try-catch whose statment calls to "clean_up". This way it prints the error then automatically cleans up the driver.
  3. The end of every script ends with clean_up() so that every process called using RSelenium closes. This way we can call other proprietary government software and more as needed for the Selenium Web scraping.

(*)Like a run-away chemical reaction.

mlane3 commented 2 years ago

Fixing chrome drivers memory, chrome driver cache, and the disk space where your driver is stored Did you know that chrome has a memory and disk space limit?

https://www.guidingtech.com/fix-google-chrome-out-of-memory-error-windows-10/#:~:text=Google%20has%20set%20a%20memory,4GB%20for%2064%2Dbit%20systems.

This is probably the least likely reason (seeing as firefox gave you an issue), but I have hit the limit before, so just in case. While you can scale it up a little bit, this is one disadvantage of using chrome.

The fix is to simply split up your R script or web driver call into more management chunks. So instead of doing all 100 million pages, maybe only do 100. Then save the work. close the Selenium driver. Pause for 1 minute. Then call the Selenium driver but this time starts back up at the 101th page.

P.S. If this is twitter data unfortunately, twitter and facebook are jerks.