ucsdscheduleplanner / UCSD-Schedule-Planner

A project to help UCSD students plan their schedules quickly and easily.
https://sdschedule.com/
MIT License
7 stars 2 forks source link

#33: Fix timeout exceptions when scraping #35

Closed dmhacker closed 5 years ago

dmhacker commented 5 years ago

See #33 for a description of the error(s).

There are two main fixes encapsulated in these commits:

  1. Better error handling across all threads. When one thread encounters a fatal exception (e.g. something wasn't caught during parsing, unseen error, etc.), it will immediately notify all other threads that they should terminate, and the scraper exits with an error. It also prints the stack trace of the error to stderr, rather than letting the thread fail unceremoniously.
  2. Retrying on page timeouts. If Selenium times out when trying to GET a page, we should retry up to some maximum number of times, rather than quitting the program immediately. Every thread will try a maximum of 10 consecutive times to download a page; if all of those retries fail, the thread will exit gracefully and signal to the scraper that it has crashed. Other threads will be killed as well.
CTrando commented 5 years ago

I'm gonna pull this and try and test if it works. Ideally we need some kind of testing for this stuff to verify that it works (even if you know it works it is good so others can also verify it).

CTrando commented 5 years ago

sdschedule-backend | Error encountered by thread 1. Gracefully exiting ... sdschedule-backend | Traceback (most recent call last): sdschedule-backend | File "/app/scraper/scraper.py", line 118, in iter_departments_by_thread_handle_errors sdschedule-backend | self.iter_departments_by_thread(thread_id, num_threads) sdschedule-backend | File "/app/scraper/scraper.py", line 137, in iter_departments_by_thread sdschedule-backend | browser = webdriver.Chrome(chrome_options=options, executable_path=DRIVER_PATH) sdschedule-backend | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in init sdschedule-backend | desired_capabilities=desired_capabilities) sdschedule-backend | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in init sdschedule-backend | self.start_session(capabilities, browser_profile) sdschedule-backend | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session sdschedule-backend | response = self.execute(Command.NEW_SESSION, parameters) sdschedule-backend | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute sdschedule-backend | self.error_handler.check_response(response) sdschedule-backend | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response sdschedule-backend | raise exception_class(message, screen, stacktrace) sdschedule-backend | selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally sdschedule-backend | (Driver info: chromedriver=2.40.565383 (76257d1ab79276b2d53ee976b2c3e3b9f335cde7),platform=Linux 4.15.0-29-generic x86_64) sdschedule-backend | sdschedule-backend | [T0] Saving ECE (#11) to /cache/course_pages/ECE/11.html ... sdschedule-backend | Thread 6 is exiting gracefully ... sdschedule-backend | Thread 3 is exiting gracefully ... sdschedule-backend | Thread 4 is exiting gracefully ... sdschedule-backend | Thread 5 is exiting gracefully ... sdschedule-backend | Thread 7 is exiting gracefully ... sdschedule-backend | Thread 0 is exiting gracefully ... sdschedule-backend | Thread 2 is exiting gracefully ... sdschedule-backend | The scraper has crashed. Please retry.

It handled the error correctly - I increased the latency on chrome to simulate bad connection, but it seemed that it crashed for reasons unrelated to latency - chrome driver failed to start.

Regardless, looks good. I'll accept - you can fix the chrome driver not starting with a wrapper method if you want but I don't think it is a huge concern.