Open Kaiden0001 opened 5 months ago
Dockerfile
FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof
RUN mkdir app
WORKDIR /app
COPY . /app
CMD ["python", "run.py", "backend"]
Only solution is to upgrade to latest version, with that this error will not occur. Upgrade by python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade
on the old version, no way?
You need to use new version to resolve it.
Same problem on the new version)
requirements.txt
cchardet==2.1.7
botasaurus-requests==4.0.16
bota==4.0.62
botasaurus==4.0.34
botasaurus_api==4.0.4
botasaurus_driver==4.0.30
botasaurus-proxy-authentication==1.0.16
botasaurus_server==4.0.23
deprecated==1.2.14
After every request
root@s# ps -A -ostat,pid,ppid | grep -e '[zZ]'
Z 3388440 3388338
Z 3388441 3388338
Z 3388443 3388338
Z 3388445 3388338
Z 3388450 3388338
Z 3388451 3388338
Z 3388452 3388338
Z 3388630 3388338
And with each request, they increase
code
from botasaurus.browser import browser, Driver
from botasaurus.request import request
from botasaurus_driver.user_agent import UserAgent
from botasaurus_driver.window_size import WindowSize
@request
def scrape_heading_task(requests, botasaurus_request: dict):
@browser(
block_images_and_css=True,
user_agent=botasaurus_request.get("user_agent") or UserAgent.RANDOM,
window_size=botasaurus_request.get("window_size") or WindowSize.RANDOM,
max_retry=botasaurus_request.get("max_retry"),
output=None,
add_arguments=["--disable-dev-shm-usage", "--no-sandbox"],
proxy=botasaurus_request.get("proxy") or None,
)
def scrape(driver: Driver, data):
driver.google_get(
link=botasaurus_request.get("url"),
bypass_cloudflare=bool(botasaurus_request.get("bypass_cloudflare")),
wait=botasaurus_request.get("wait"),
)
return {"text": driver.page_html, "cookies": driver.get_cookies()}
try:
return scrape()
except Exception as e:
return {"error": str(e)}
Dockerfile
FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof xvfb
RUN mkdir app
WORKDIR /app
COPY . /app
CMD ["python", "run.py", "backend"]
OS: Ubuntu 22.04 LTS x86_64
running in docker
scrapers.py
import os
from botasaurus_server.server import Server
from src.scrape_heading_task import scrape_heading_task
Server.rate_limit["browser"] = os.getenv("MAX_BROWSERS", 3)
Server.add_scraper(scrape_heading_task)
scrape_heading_task.js
/**
* @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
*/
/**
* @param {Controls} controls
*/
function getInput(controls) {
controls.link('url', {isRequired: true})
controls.text('user_agent', {isRequired: false})
controls.listOfTexts('window_size', {isRequired: false})
controls.text('proxy', {isRequired: false})
controls.number('max_retry', {isRequired: false, defaultValue: 2})
controls.number('bypass_cloudflare', {isRequired: false, defaultValue: 0})
controls.number('wait', {isRequired: false, defaultValue: 5})
}
call
api = Api(server_url)
data = self.get_data(botasaurus_request)
task = api.create_async_task(
data=data,
scraper_name="scrape_heading_task",
)
result = self.get_task_result(
api,
task.get("id"),
botasaurus_request.timeout,
botasaurus_request.wait,
)
This issue occurs only in Docker, to resolve it run command
python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade
With this the zombie processes will be periodically purged, and won't reach more than 10 at any point.
Unfortunately, I have the same problem. Upgrading to the latest version doesn't really help.
Could someone please provide some information why this is happening and how it's possible to debug this error?
please run python -m pip install bota botasaurus botasaurus-api botasaurus-requests botasaurus-driver bota botasaurus-proxy-authentication botasaurus-server --upgrade if that does not works, please share steps to reproduce error.
Well, I think I was able to reproduce the bug and figure out how to fix it:
To reproduce the bug, you can use the official botasaurus starter project. You need to run the project in a docker container using the docker-compose.yml file. The next step is to run some scraping tasks from the web interface. Finally, you need to run the top
command inside your container. You'll probably see some zombie processes from Chrome.
The problem exists in all docker/podman containers due to the PID 1 zombie raping problem.
To solve this problem you could use images as suggested in the article above.
Another solution (which I actually prefer) is to use the --init
flag in your docker-compose.yml
file, see the docker documentation.
How to use the --init flag?
if you're using docker-compose.yml file, just add init: true
to the container which is running botasaurus
Like this:
services:
bot-1:
init: true
restart: "no"
shm_size: 800m
build:
dockerfile: Dockerfile
context: .
volumes:
- .:/app
ports:
- "3000:3000"
- "8000:8000"
Like this:
services: bot-1: init: true restart: "no" shm_size: 800m build: dockerfile: Dockerfile context: . volumes: - .:/app ports: - "3000:3000" - "8000:8000"
exactly
thanks
Hi requirements.txt
zombie processes remain after execution
after 800 requests error:
or
How to make them deleted after requests ?