I've been trying to use your scraper in a Docker image to scrape jobs but I'm faced with what looks like a memory leak. Here's the code I'm using
scraper = LinkedinScraper(
chrome_executable_path=chrome_driver_path,
chrome_options=None, # Custom Chrome options here
headless=True, # Overrides headless mode only if chrome_options is None
max_workers=1, # How many threads will be spawned to run queries concurrently (one Chrome driver for each thread)
slow_mo=1, # Slow down the scraper to avoid 'Too many requests 429' errors (in seconds)
page_load_timeout=20 # Page load timeout (in seconds)
)
query = Query(
query="software engineer",
options=QueryOptions(
locations=['Netherlands'],
apply_link=False, # Try to extract apply link (easy applies are skipped). Default to False.
limit=jobs_to_scrape,
filters=QueryFilters(
relevance=RelevanceFilters.RECENT,
time=TimeFilters.DAY,
type=[TypeFilters.FULL_TIME, TypeFilters.TEMPORARY],
)
)
)
# ... add functions to handle the Events
scraper.run(query)
when I run this for 1 job, it all goes well. If I run this for 15 jobs (for example), then I see the memory usage in the container grow until it reaches 500MB (the limit I set). At that point, I get the following exception:
2022-11-08 16:43:41,895 ('[software engineer][Netherlands]', InvalidSessionIdException('invalid session id', None, None))
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/linkedin_jobs_scraper/strategies/authenticated_strategy.py", line 403, in run
load_result = AuthenticatedStrategy.__load_job_details(driver, job_id)
File "/usr/local/lib/python3.8/site-packages/linkedin_jobs_scraper/strategies/authenticated_strategy.py", line 101, in __load_job_details
loaded = driver.execute_script(
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 634, in execute_script
return self.execute(command, {
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: headless chrome=107.0.5304.87)
The RAM usage goes down to a "normal" 30MB but CPU usage remains high (probably chrome/chromedriver/selenium( is still doing stuff in the background.
Any ideas how I could fix this without having to use more RAM?
Hi there,
I've been trying to use your scraper in a Docker image to scrape jobs but I'm faced with what looks like a memory leak. Here's the code I'm using
when I run this for 1 job, it all goes well. If I run this for 15 jobs (for example), then I see the memory usage in the container grow until it reaches 500MB (the limit I set). At that point, I get the following exception:
The RAM usage goes down to a "normal" 30MB but CPU usage remains high (probably chrome/chromedriver/selenium( is still doing stuff in the background.
Any ideas how I could fix this without having to use more RAM?