umihico / docker-selenium-lambda

The simplest demo of chrome automation by python and selenium in AWS Lambda
MIT License
530 stars 126 forks source link

chrome not reachable #102

Closed Convalytics closed 1 year ago

Convalytics commented 2 years ago

This code works great, but I'm consistently getting "chrome not reachable" errors if I run another instance of my lambda within a few minutes or a prior run. I've tried manually creating "tmp" folders for the 3 chrome data folers as well as using mkdtemp().

For background: My processes run for about 1-2 minutes and access multiple sites. Some processes require that I download and upload files. I've been having this issue for several months from chrome 99 through 103. My current "solution" is to retry after a long wait period, hoping that the lambda instance is wiped out and the next process starts on a fresh/cold instance.

I'm hoping someone can take a look at my settings and guide me in the right direction.


from selenium import webdriver
import json
import base64
import os
import shutil
from tempfile import mkdtemp

# functions for printing web page to pdf:
def send_devtools(driver, cmd, params={}):
  resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
  url = driver.command_executor._url + resource
  body = json.dumps({'cmd': cmd, 'params': params})
  response = driver.command_executor._request('POST', url, body)
  # if response['status']:
  #   raise Exception(response.get('value'))
  return response.get('value')

def save_as_pdf(driver, path, options={}):
  # https://timvdlippe.github.io/devtools-protocol/tot/Page#method-printToPDF
  result = send_devtools(driver, "Page.printToPDF", options)
  with open(path, 'wb') as file:
    file.write(base64.b64decode(result['data']))

def enable_download_in_headless_chrome(driver, download_dir):
    # add missing support for chrome "send_command"  to selenium webdriver
    driver.command_executor._commands["send_command"] = ("POST",'/session/$sessionId/chromium/send_command')
    params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
    command_result = driver.execute("send_command", params)

def getchrome(s3key='temp'):
    # Set options for headless chrome:
    options = webdriver.ChromeOptions()
    options.binary_location = '/opt/chrome/chrome'
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1280x1696")
    options.add_argument("--single-process")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-dev-tools")
    options.add_argument("--no-zygote")
    options.add_argument(f"--user-data-dir={mkdtemp()}")
    options.add_argument(f"--data-path={mkdtemp()}")
    options.add_argument(f"--disk-cache-dir={mkdtemp()}")
    # options.add_argument(f"--user-data-dir=/tmp/{s3key}/udd")
    # options.add_argument(f"--data-path=/tmp/{s3key}/dp")
    # options.add_argument(f"--disk-cache-dir=/tmp/{s3key}/dcd")
    options.add_argument("--remote-debugging-port=9222")

    options.add_experimental_option("prefs", {
        "download.default_directory": f"/tmp/{s3key}/downloads",
        "savefile.default_directory": f"/tmp/{s3key}/downloads",
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    })
    options.add_experimental_option('excludeSwitches', ['enable-logging'])

    driver = webdriver.Chrome('/opt/chromedriver', options=options)

    # Allow chrome to download files. Set download directory.
    driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
    params = {'cmd':'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': f'/tmp/{s3key}/downloads'}}
    driver.execute("send_command", params)

    return driver
umihico commented 2 years ago

@Convalytics

Hello. Thank you for sponsoring! This is also a great bug report.

I was expecting randomizing data directory to avoid duplication was enough, but I can introduce my older solution. It was cleaning /tmp fully.

To share, I created a demo branch. Could you try and tell me how it goes? https://github.com/umihico/docker-selenium-lambda/commit/85dfc26f7b8af2281c8d215996d9a2869c264d05

umihico commented 2 years ago

Possibly your s3 files size + chrome tmp size are more than 512MB? Then chrome may crush. It must counts prior files if not cold start and could be bigger more than expected

humphrey commented 2 years ago

Oh man! Thank you for the repository AND this issue!

Possibly your s3 files size + chrome tmp size are more than 512MB? Then chrome may crush. It must counts prior files if not cold start and could be bigger more than expected

It look me a lot of brute force testing to come to this conclusion. The code in this repository works perfectly, but as soon as I swapped out https://example.com/ for almost any other website, everything starts crashing. Sometimes it took until the second run, but it'd eventually crash.

Increasing the size to be larger than 512mb fixed the problem, and I added your flush_tmp() function in for good measure since I'll be calling this many times.

I could be worth mentioning that Chrome will randomly crash in the main README.md file until you use more than 512mb storage.

umihico commented 2 years ago

@humphrey

but as soon as I swapped out https://example.com/ for almost any other website, everything starts crashing.

Thank you for sharing! I didn't know that. Chrome is rapidly changing its major version, so latest ones may consume storage more than before.

I'll think about increasing memory limit by Serverless Framework config.

humphrey commented 2 years ago

@umihico

Thank you for sharing! I didn't know that. Chrome is rapidly changing its major version, so latest ones may consume storage more than before.

Thanks! Also, I suspect that websites are running more and more bloated JavaScripts, which would use more memory. The sites I needed to test take a while to load all of those things into memory to display the site.

I reckon something in the README.md that sayings something like.

If you experience any errors such as Chrome crashing or not being available you might need to increase the storage available to your Lambda Function.

I'll think about increasing memory limit by Serverless Framework config.

Because even if the value in Serverless config was good enough, I added a couple of my own things (such wrapping Selenium in a GraphQL API) which takes up more storage.

umihico commented 1 year ago

@humphrey I'm so sorry to keep this issue so long.

I added the note as you suggested and hope it helps others. Thank you again.

humphrey commented 1 year ago

πŸ‘Œ

samkit-jain commented 10 months ago

@umihico Is it safe to flush the temporary storage? Since AWS Lambda reuses the storage, and multiple invocations are running in parallel, wouldn't this cause unexpected issues?

For anyone facing this issue, I also had to remove the following option

options.add_argument("--remote-debugging-port=9222")

Removing it gave me far fewer Chrome unreachable errors

umihico commented 10 months ago

@samkit-jain I don't think temporary storage is not shared by multiple invocations if they are running at the same time. After the invocation process completely end, new invocation may use the used storage. This is my idea about AWS Lambda.

Also I don't think current code is flushing storage. It is just specifying location randomly, therefore all lambda processes can have own clean space even if they get old used storages

samkit-jain commented 10 months ago

@umihico Thanks πŸ™πŸ»