ultrafunkamsterdam / undetected-chromedriver

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
https://github.com/UltrafunkAmsterdam/undetected-chromedriver
GNU General Public License v3.0
9.95k stars 1.16k forks source link

Multiprocessing Error returned when ran from Docker #752

Open Sagaryal opened 2 years ago

Sagaryal commented 2 years ago

Continuation of #740

This issue has been created again because no response was received for the comment in the previously closed ticket. I am pasting the code again here and the issue.

you are trying to load plain html to json.

@ultrafunkamsterdam Sir please do once check the above Edited code again. It's not because of plain html to json. As you can see even plain return response is not working.

Furthermore, the root() function executes only when API is called. But Chrome initialization is done before and code is not even running.

As you can see error logs. Before Chrome Initiasation ------> is being printed but not After Chrome Initiasation ------> which means that Chrome is not being initialized.

Also the error logs point out error in below line in your code:

File "/usr/local/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached scrapper | p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)

Nevertheless, I have tried and edited the code and error response above for your reference.

Also Please do note that THIS IS WORKING in non-docker.

The below code works fine when running locally using virtualenv. But when I dockerized it, an error is received. As I debugged it seems the error is from driver = uc.Chrome(headless=True)

Python Versions tried: 3.10, 3.8

import json
from fastapi import FastAPI
import undetected_chromedriver.v2 as uc
from selenium.webdriver.common.by import By
from utils import Item

app = FastAPI()
print('Before Chrome Initiasation ------>')
driver = uc.Chrome(headless=True)
print('After Chrome Initiasation ------>')

@app.post("/")
def root(item: Item):
    # known url using cloudflare's "under attack mode"
    # driver.get(item.url)
    # html = driver.find_element(By.TAG_NAME, 'html').text

    return "This is return data"

Error:

Attaching to scrapper
scrapper  | Before Chrome Initiasation ------>
scrapper  | Process Process-1:
scrapper  | Traceback (most recent call last):
scrapper  |   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
scrapper  |     self.run()
scrapper  |   File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
scrapper  |     self._target(*self._args, **self._kwargs)
scrapper  |   File "/usr/local/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached
scrapper  |     p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)
scrapper  |   File "/usr/local/lib/python3.10/subprocess.py", line 969, in __init__
scrapper  |     self._execute_child(args, executable, preexec_fn, close_fds,
scrapper  |   File "/usr/local/lib/python3.10/subprocess.py", line 1720, in _execute_child
scrapper  |     and os.path.dirname(executable)
scrapper  |   File "/usr/local/lib/python3.10/posixpath.py", line 152, in dirname
scrapper  |     p = os.fspath(p)
scrapper  | TypeError: expected str, bytes or os.PathLike object, not NoneType

Dockerfile

FROM python:3.10

ENV PYTHONUNBUFFERED True

WORKDIR /app

COPY requirements.txt ./

RUN pip install -r requirements.txt

EXPOSE 8000

COPY . ./

CMD exec uvicorn main:app --host 0.0.0.0 --port 8000
blacksam07 commented 2 years ago

@Sagaryal Hi man, Today get the same problem as you, you could found any solution?

Sagaryal commented 2 years ago

@Sagaryal Hi man, Today get the same problem as you, you could found any solution?

@blacksam07 Fortunately Yes, with some peeking into the source code and this comment and its reply.

Below is my workable code. Have explained the detail in another comment next to it to keep this solution comment short and precise. Hope it helps you too.

Dockerfile:

FROM python:3.10-alpine

ENV PYTHONUNBUFFERED True

# chromium is not found inside docker, so need to install it.
RUN apk add --update make gcc g++ libc-dev chromium chromium-chromedriver

WORKDIR /app

COPY requirements.txt ./

RUN pip install -r requirements.txt

EXPOSE 8000

COPY . ./

CMD exec uvicorn main:app --host 0.0.0.0 --port 8000

main.py:

import json
from fastapi import FastAPI
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from utils import Item

app = FastAPI()

'''
Even after installing chromedriver and browser, it returns another bizzarre error.
Turns out you need to pass the driver path

chromedriver path is: /usr/bin/chromedriver
chrome browser path: /usr/bin/chromium-browser (this is automagically found by code)

now running this from venv would not work because of next comment
'''
driver = uc.Chrome(headless=True, driver_executable_path='/usr/bin/chromedriver')

@ app.post("/")
def root(item: Item):
    # known url using cloudflare's "under attack mode"
    driver.get(item.url)
    html = driver.find_element(By.TAG_NAME, 'html').text

    return json.loads(html)
Sagaryal commented 2 years ago

Detail explaination as reference to above comment

I incorrectly assumed that the package would also download the chromium driver and browser as the puppeteer would.

In this line, executable is empty i.e None and hence the below error

scrapper  |   File "/usr/local/lib/python3.10/posixpath.py", line 152, in dirname
scrapper  |     p = os.fspath(p)
scrapper  | TypeError: expected str, bytes or os.PathLike object, not NoneType

While running locally (venv), its executable path was /usr/bin/google-chrome. But for docker, no such chromedriver/browser is installed either by package or by us which is required by selenium webdriver to open the browser and perform our tasks. 😐

Trying to install chromium chromium-driver in python-3.10-slim docker always gave be debian connection error. 😶

So ended up using ain alpine image with some additional packages so that requirements.txt packages install smoothly.

FROM python:3.10-alpine
RUN apk add --update make gcc g++ libc-dev chromium chromium-chromedriver

Hoping that now it would run smoothly 😁, I was shattered by yet another Chinese error 💔

With the help of this comment and Readme I tried setting executable_path and browser_executable_path but ran to the same error.

Then I got my hands again dirty by diving into the source code and found that there is no executable_path argument but there was driver_executable_path which was neither mentioned anywhere nor was found automagically like the browser executable.😑

Upon trying setting the above argument with chrome driver path: driver_executable_path='/usr/bin/chromedriver', Voila it worked. 🤩 🥳

The reason driver_executable_path was not mentioned or needed is that the program would create one as below /root/.local/share/undetected_chromedriver/739aa58183d6f966_chromedriver everytime you start the program. But in the case of docker, it didn't create that chromedriver or somehow download/install it (haven't seen that part of the code). So we needed to manually install chromedriver and provide its path.

So now if you run it locally (venv) it would not because we manually provided the driver path to /usr/bin/chromedriver where there is no chromedriver. So for now you might need to copy latest any one driver from ~/.local/share/undetected_chromedriver/ to /usr/bin/ path with executable permission in your local machine.

Hope now its clear. Thanks

Sagaryal commented 2 years ago

The reason driver_executable_path was not mentioned or needed is that the program would create one as below /root/.local/share/undetected_chromedriver/739aa58183d6f966_chromedriver everytime you start the program. But in the case of docker, it didn't create that chromedriver or somehow download/install it (haven't seen that part of the code). So we needed to manually install chromedriver and provide its path.

@ultrafunkamsterdam Sir any reason why no chromedriver is download/created /root/.local/share/undetected_chromedriver ?

ultrafunkamsterdam commented 2 years ago

i suggest using official docker image. hub username ultrafunk . and read the readme of it

blacksam07 commented 2 years ago

@Sagaryal Oh yes, Yesterday I detected that chrome was not installed I try installing it, but I don't having present the problem with the path for chromedriver 😞 , thanks for your explanation. this solution is working for me 👏🏽.

I use my own docker image because I need to setup more thinks to deploy on AWS lambda

yaguangtang commented 2 years ago

@blacksam07 have you fixed the issue, I am working on the same , build a custom docker image to be used by AWS lambda

rickardcronholm commented 2 years ago

@ultrafunkamsterdam Would you mind providing a Dockerfile for that image?

blacksam07 commented 2 years ago

@yaguangtang Sorry man I forget to replay your message, and yes I can solve the problem and create my own docker image, this is the docker file

DockerFile

# Define global args
ARG FUNCTION_DIR="/home/app/"
ARG RUNTIME_VERSION="3.9"
ARG DISTRO_VERSION="3.16"

# Stage 1 - bundle base image + runtime
# Grab a fresh copy of the image and install GCC
FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine
# Install GCC (Alpine uses musl but we compile and link dependencies with GCC)
RUN apk add --no-cache \
    libstdc++

# Stage 2 - build function and dependencies
FROM python-alpine AS build-image
# Install aws-lambda-cpp build dependencies
RUN apk add --no-cache \
    build-base \
    libtool \
    autoconf \
    automake \
    libexecinfo-dev \
    make \
    cmake \
    libcurl \
    curl \
    gcc \
    g++
# Include global args in this stage of the build
ARG FUNCTION_DIR
ARG RUNTIME_VERSION
# Create function directory
RUN mkdir -p ${FUNCTION_DIR}
# Copy required files
COPY patcher.py ${FUNCTION_DIR}
COPY function_name.py ${FUNCTION_DIR}
COPY requirements.txt .
# Optional – Install the function's dependencies
RUN python${RUNTIME_VERSION} -m pip install --upgrade pip
RUN python${RUNTIME_VERSION} -m pip install -r requirements.txt --target ${FUNCTION_DIR}
# Fix undetected_chromedriver to use in lambda
RUN cd ${FUNCTION_DIR} && cp -f patcher.py ${FUNCTION_DIR}/undetected_chromedriver
# Install Lambda Runtime Interface Client for Python
RUN python${RUNTIME_VERSION} -m pip install awslambdaric --target ${FUNCTION_DIR}

# Stage 3 - final runtime image
# Grab a fresh copy of the Python image
FROM python-alpine
# Include global arg in this stage of the build
ARG FUNCTION_DIR
# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}
# Copy in the built dependencies
COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}
RUN apk add --no-cache chromium 
RUN wget https://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_linux64.zip
RUN cp /usr/bin/chromedriver ${FUNCTION_DIR}
# (Optional) Add Lambda Runtime Interface Emulator and use a script in the ENTRYPOINT for simpler local runs
ADD https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie /usr/bin/aws-lambda-rie
COPY entry.sh /
RUN chmod 755 /usr/bin/aws-lambda-rie /entry.sh
ENTRYPOINT [ "/entry.sh" ]
CMD [ "function.handler" ]

entry.sh

#!/bin/sh
if [ -z "${AWS_LAMBDA_RUNTIME_API}" ]; then
    exec /usr/bin/aws-lambda-rie /usr/local/bin/python -m awslambdaric $1
else
    exec /usr/local/bin/python -m awslambdaric $1
fi

the patcher file is the same as the PR #643 is not merge but you need this change to run in aws, remember that you need to create a EC2 intance and deploy in aws using this instance. if you need more help, write me

esamhassan1 commented 2 years ago

@blacksam07 I used ur docker file but I get: Error: fork/exec /entry.sh: no such file or directory do you know the reason for this?

blacksam07 commented 2 years ago

@esamhassan1 it's possible that you don't have the entry.sh file in the folder of Dockerfile, and not copy this into the docker image

esamhassan1 commented 2 years ago

@blacksam07 I have it, otherwise it would have raised error when building, but I get this error when trying to run it on AWS Lambda. did you run it yourself on AWS?

esamhassan1 commented 2 years ago

I solved it using the direct path in entrypoint ENTRYPOINT ["/usr/local/bin/python", "-m", "awslambdaric"], but I now get this error:

{ "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'", "errorType": "OSError", "requestId": "ceaeedc7-b520-457e-96fd-e2020b26c5ef", "stackTrace": [ " File \"/home/app/app.py\", line 34, in lambda_handler\n driver = uc.Chrome(headless=True\n", " File \"/home/app/undetected_chromedriver/init.py\", line 235, in init\n patcher = Patcher(\n", " File \"/home/app/undetected_chromedriver/patcher.py\", line 66, in init\n os.makedirs(self.data_path, exist_ok=True)\n", " File \"/usr/local/lib/python3.9/os.py\", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/usr/local/lib/python3.9/os.py\", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/usr/local/lib/python3.9/os.py\", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File \"/usr/local/lib/python3.9/os.py\", line 225, in makedirs\n mkdir(name, mode)\n" ] }

blacksam07 commented 2 years ago

@esamhassan1, Yes with this config I can run on AWS without problem.

this error is because you need to change the patcher.py according to this PR #643

I solved it using the direct path in entrypoint ENTRYPOINT ["/usr/local/bin/python", "-m", "awslambdaric"], but I now get this error:

{ "errorMessage": "[Errno 30] Read-only file system: '/home/sbx_user1051'", "errorType": "OSError", "requestId": "ceaeedc7-b520-457e-96fd-e2020b26c5ef", "stackTrace": [ " File "/home/app/app.py", line 34, in lambda_handler\n driver = uc.Chrome(headless=True\n", " File "/home/app/undetected_chromedriver/init.py", line 235, in init\n patcher = Patcher(\n", " File "/home/app/undetected_chromedriver/patcher.py", line 66, in init\n os.makedirs(self.data_path, exist_ok=True)\n", " File "/usr/local/lib/python3.9/os.py", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File "/usr/local/lib/python3.9/os.py", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File "/usr/local/lib/python3.9/os.py", line 215, in makedirs\n makedirs(head, exist_ok=exist_ok)\n", " File "/usr/local/lib/python3.9/os.py", line 225, in makedirs\n mkdir(name, mode)\n" ] }

esamhassan1 commented 2 years ago

@blacksam07 I did that, but still get the error, I suspect I still have some problems with the path. even with the new patcher file it tries to write in a wrong directory

grayskripko commented 1 year ago

has relation to https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/747

luiz-maxia commented 1 year ago

My problem was similar. I was getting some zip-related errors and the answer of Sagaryal enlighten me for the solution! Thank you.

To be more precise for future reader, I was trying to use multiprocessing (workers greater than one) on my FastAPI application (which is quite complex).

But I was getting this traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/archive.py", line 39, in __extract_zip
    archive.extractall(to_directory)
  File "/usr/local/lib/python3.9/zipfile.py", line 1642, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/usr/local/lib/python3.9/zipfile.py", line 1695, in _extract_member
    with self.open(member, pwd=pwd) as source, \
  File "/usr/local/lib/python3.9/zipfile.py", line 1529, in open
    raise BadZipFile("Truncated file header")
zipfile.BadZipFile: Truncated file header
test_design_4-ds_api-1  |
During handling of the above exception, another exception occurred:
test_design_4-ds_api-1  |
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/server.py", line 60, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/uvicorn/server.py", line 67, in serve
    config.load()
  File "/usr/local/lib/python3.9/site-packages/uvicorn/config.py", line 477, in load
    self.loaded_app = import_from_string(self.app)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/api/./main.py", line 206, in <module>
    model_instance = model.Model.from_path(MODEL_DIR)
  File "/api/./model.py", line 2286, in from_path
    ChromeDriverManager().install()
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/chrome.py", line 39, in install
    driver_path = self._get_driver_path(self.driver)
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/manager.py", line 31, in _get_driver_path
    binary_path = self.driver_cache.save_file_to_cache(driver, file)
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/driver_cache.py", line 46, in save_file_to_cache
    files = archive.unpack(path)
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/archive.py", line 30, in unpack
    return self.__extract_zip(directory)
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/archive.py", line 41, in __extract_zip
    if e.args[0] not in [26, 13] and e.args[1] not in [
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/webdriver_manager/core/archive.py", line 39, in __extract_zip
    archive.extractall(to_directory)
  File "/usr/local/lib/python3.9/zipfile.py", line 1642, in extractall
    self._extract_member(zipinfo, path, pwd)
  File "/usr/local/lib/python3.9/zipfile.py", line 1697, in _extract_member
    shutil.copyfileobj(source, target)
  File "/usr/local/lib/python3.9/shutil.py", line 205, in copyfileobj
    buf = fsrc_read(length)
  File "/usr/local/lib/python3.9/zipfile.py", line 924, in read
    data = self._read1(n)
  File "/usr/local/lib/python3.9/zipfile.py", line 992, in _read1
    data += self._read2(n - len(data))
  File "/usr/local/lib/python3.9/zipfile.py", line 1027, in _read2
    raise EOFError
EOFError

But the problem was quite simple. At some part of my code I was installing the ChromeDriver via the ChromeDriverManager from webdriver_manager.chrome module, and it tryied to install it on each worker (spawn process), but to do that it had to download the webdriver (which yields a zip file) and unzip it. However, since they all tried to access the zip at the same time, it conflicted.

The solution, thanks again Sagaryal for the enlightenment, was quite happened to be quite simple, just installing the ChromeDriver on build time (added the following lines to my Dockerfile:

RUN pip install webdriver-manager==3.8.5
RUN python -c "from webdriver_manager.chrome import ChromeDriverManager; from os import environ; print(ChromeDriverManager(version=environ['CHROMEDRIVE_VERSION']).install())"

I set the version of my ChromeDriver for compatibility purposes, but if one is comfortable using the latest, just adding:

RUN pip install webdriver-manager
RUN python -c "from webdriver_manager.chrome import ChromeDriverManager; print(ChromeDriverManager().install())"

Suffices. Hope it helps someone having the same problem I had :)