omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
https://www.omkar.cloud/botasaurus/
MIT License
1.18k stars 107 forks source link

Fail close driver on browser crashing #7

Closed goforbroke1006 closed 1 year ago

goforbroke1006 commented 1 year ago

Description

Each selenium.common.exceptions.InvalidSessionIdException error breaks execution of bose.launch_tasks.launch_tasks function.

Steps to Reproduce

  1. Create task to scan some site. Iterate couple of pages (http://site/page-1 http://site/page-2 ... http://site/page-6)
  2. Package code to docker image
  3. Run container
  4. Got errors with message like "Message: unknown error: session deleted because of page crash"
  5. Catch InvalidSessionIdException inside task.run(self, driver: BoseDriver, data: any)
  6. Got an error TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

Expected behavior: broken task can be finished normally

Actual behavior: broken task stops all process, next tasks will not executed

Reproduces how often: for sites with bot detection - 99% cases

Additional context

Can't reproduce on host machine. Only inside docker container.

Full stack-trace:

Traceback (most recent call last):
2023-07-16T19:02:56.132915200Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 210, in run_task
2023-07-16T19:02:56.132923000Z     close_driver(driver)
2023-07-16T19:02:56.132927600Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 203, in close_driver
2023-07-16T19:02:56.132939700Z     driver.close()
2023-07-16T19:02:56.133008400Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 551, in close
2023-07-16T19:02:56.133173900Z     self.execute(Command.CLOSE)
2023-07-16T19:02:56.133260300Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
2023-07-16T19:02:56.133411400Z     self.error_handler.check_response(response)
2023-07-16T19:02:56.133519800Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
2023-07-16T19:02:56.133590700Z     raise exception_class(message, screen, stacktrace)
2023-07-16T19:02:56.133747600Z selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
2023-07-16T19:02:56.133802800Z Stacktrace:
2023-07-16T19:02:56.133812700Z #0 0x55a7b015a233 <unknown>
2023-07-16T19:02:56.133817900Z #1 0x55a7afe89770 <unknown>
2023-07-16T19:02:56.133822800Z #2 0x55a7afeb9589 <unknown>
2023-07-16T19:02:56.133826700Z #3 0x55a7afee4b86 <unknown>
2023-07-16T19:02:56.133830800Z #4 0x55a7afee0dea <unknown>
2023-07-16T19:02:56.133834700Z #5 0x55a7afee0516 <unknown>
2023-07-16T19:02:56.133838800Z #6 0x55a7afe593a3 <unknown>
2023-07-16T19:02:56.133843100Z #7 0x55a7b011a114 <unknown>
2023-07-16T19:02:56.133858100Z #8 0x55a7b011df67 <unknown>
2023-07-16T19:02:56.133863200Z #9 0x55a7b01286b0 <unknown>
2023-07-16T19:02:56.133867700Z #10 0x55a7b011ebb3 <unknown>
2023-07-16T19:02:56.133871100Z #11 0x55a7b00ec95a <unknown>
2023-07-16T19:02:56.133874900Z #12 0x55a7afe57b83 <unknown>
2023-07-16T19:02:56.133878600Z #13 0x7f92a414e18a <unknown>
2023-07-16T19:02:56.133882400Z 
2023-07-16T19:02:56.133886500Z 
2023-07-16T19:02:56.133890300Z During handling of the above exception, another exception occurred:
2023-07-16T19:02:56.133893900Z 
2023-07-16T19:02:56.133898000Z Traceback (most recent call last):
2023-07-16T19:02:56.133902300Z   File "/code/main.py", line 5, in <module>
2023-07-16T19:02:56.133907800Z     launch_tasks(*tasks_to_be_run)
2023-07-16T19:02:56.133919000Z   File "/code/venv/lib/python3.11/site-packages/bose/launch_tasks.py", line 54, in launch_tasks
2023-07-16T19:02:56.134112000Z     current_output = task.begin_task(current_data, task_config)
2023-07-16T19:02:56.134164700Z                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134195200Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 237, in begin_task
2023-07-16T19:02:56.134256000Z     final = run_task(False, 0)
2023-07-16T19:02:56.134311100Z             ^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134322600Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 221, in run_task
2023-07-16T19:02:56.134434300Z     end_task(driver)
2023-07-16T19:02:56.134487500Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 149, in end_task
2023-07-16T19:02:56.134570900Z     task.end()
2023-07-16T19:02:56.134647300Z   File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 38, in end
2023-07-16T19:02:56.134716200Z     self.data["duration"] = format_time_diff(self.data["start_time"],self.data["end_time"])
2023-07-16T19:02:56.134774900Z                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134807900Z   File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 11, in format_time_diff
2023-07-16T19:02:56.134914700Z     time_diff = end_time - start_time
2023-07-16T19:02:56.134991400Z                 ~~~~~~~~~^~~~~~~~~~~~
2023-07-16T19:02:56.135022200Z TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'
Chetan11-dev commented 1 year ago

Update Bose by execute the following command: python -m pip install bose --upgrade.

In the browser configuration, utilize the "close on crash" and "undetected" options:

class Task(BaseTask):
    task_config = TaskConfig(
        close_on_crash=True,
        use_undetected_driver=True,
    )

Also, selenium is crashing in Docker due to low memory increase it's memory and it should work. Also, Could you share your dockerfile?

goforbroke1006 commented 1 year ago

No, I have 2.0.8 - it's not too old version. Yeah, I found this option (close_on_crash), thanks! But I guess important thing to notice somewhere in guides that webdriver required shm and if you run it with docker, you have to specify --shm-size. Because shm was reason of the InvalidSessionException.

Chetan11-dev commented 1 year ago

@goforbroke1006 Was interested to see a selenium Docker file for learning purposes. Could you show it?

goforbroke1006 commented 1 year ago

Yeah, sure! This config fits to my purposes:

FROM debian:bookworm-slim

RUN apt update && apt upgrade -y

RUN apt install -y curl unzip

RUN apt-get install python3.11 python3-pip python3.11-venv -y
RUN python3 --version

# https://packages.debian.org/sid/chromium
ARG CHROME_VERSION='114.0.5735.198-1'
ARG CHROMIUM_DEB_VERSION="${CHROME_VERSION}~deb12u1"
# http://chromedriver.storage.googleapis.com/
ARG CHROMEDRIVER_VERSION='114.0.5735.90'

RUN apt install -y \
    chromium-common=$CHROMIUM_DEB_VERSION \
    chromium-sandbox=$CHROMIUM_DEB_VERSION \
    chromium=$CHROMIUM_DEB_VERSION

RUN mkdir -p /code/build/
RUN curl -O -L http://chromedriver.storage.googleapis.com/${CHROMEDRIVER_VERSION}/chromedriver_linux64.zip
RUN unzip ./chromedriver_linux64.zip -d /code/build/
RUN chmod -R 0777 /code/build/

ENV PYTHONUNBUFFERED=1
ENV PYTHONIOENCODING=utf-8
ENV PYTHONLEGACYWINDOWSSTDIO=utf-8
ENV ENV=production

WORKDIR /code/

ADD requirements.txt /code/requirements.txt
RUN python3 -m venv ./venv && . venv/bin/activate && pip3 install -r requirements.txt

COPY ./src         /code/src
COPY ./launcher.py /code/launcher.py
COPY ./main.py     /code/main.py

RUN echo '#!/bin/bash \n\
\n\
args=$* \n\
\n\
. venv/bin/activate \n\
python3 main.py ${args} \n\
' > /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
goforbroke1006 commented 1 year ago

And compose like this:

version: "3.9"

services:

  .base-task: &base-task
    image: docker.io/goforbroke1006/my-awesome-project:latest
    volumes:
      - ./output:/code/output:rw
      - ./profiles:/code/profiles:rw
      - ./tasks:/code/tasks:rw
      - ./local_storage.json:/code/local_storage.json:rw
      - ./profiles.json:/code/profiles.json:rw
    shm_size: "512Mb"

  task1-scan-someting:
    <<: *base-task
    command: someting-one

  task2-scan-someting:
    <<: *base-task
    command: someting-two
Chetan11-dev commented 1 year ago

Thanks