spinlud / py-linkedin-jobs-scraper

MIT License
307 stars 84 forks source link

Selenium exception: session deleted because of page crash #48

Open aalloul opened 1 year ago

aalloul commented 1 year ago

Hi there,

I've been trying to use your scraper in a Docker image to scrape jobs but I'm faced with what looks like a memory leak. Here's the code I'm using

scraper = LinkedinScraper(
                chrome_executable_path=chrome_driver_path,
                chrome_options=None,  # Custom Chrome options here
                headless=True,  # Overrides headless mode only if chrome_options is None
                max_workers=1,  # How many threads will be spawned to run queries concurrently (one Chrome driver for each thread)
                slow_mo=1,  # Slow down the scraper to avoid 'Too many requests 429' errors (in seconds)
                page_load_timeout=20  # Page load timeout (in seconds)
            )
query = Query(
                query="software engineer",
                options=QueryOptions(
                    locations=['Netherlands'],
                    apply_link=False,  # Try to extract apply link (easy applies are skipped). Default to False.
                    limit=jobs_to_scrape,
                    filters=QueryFilters(
                        relevance=RelevanceFilters.RECENT,
                        time=TimeFilters.DAY,
                        type=[TypeFilters.FULL_TIME, TypeFilters.TEMPORARY],
                    )
                )
            )
# ... add functions to handle the Events
scraper.run(query)

when I run this for 1 job, it all goes well. If I run this for 15 jobs (for example), then I see the memory usage in the container grow until it reaches 500MB (the limit I set). At that point, I get the following exception:

2022-11-08 16:43:41,895 ('[software engineer][Netherlands]', InvalidSessionIdException('invalid session id', None, None))
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/linkedin_jobs_scraper/strategies/authenticated_strategy.py", line 403, in run
    load_result = AuthenticatedStrategy.__load_job_details(driver, job_id)
  File "/usr/local/lib/python3.8/site-packages/linkedin_jobs_scraper/strategies/authenticated_strategy.py", line 101, in __load_job_details
    loaded = driver.execute_script(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 634, in execute_script
    return self.execute(command, {
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=107.0.5304.87)

The RAM usage goes down to a "normal" 30MB but CPU usage remains high (probably chrome/chromedriver/selenium( is still doing stuff in the background.

Any ideas how I could fix this without having to use more RAM?

spinlud commented 1 year ago

Hi, when you say 15 jobs do you mean 15 workers? Are you trying to run 15 queries in parallel?