Closed Kaiden0001 closed 1 month ago
Dockerfile
FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof
RUN mkdir app
WORKDIR /app
COPY . /app
CMD ["python", "run.py", "backend"]
Only solution is to upgrade to latest version, with that this error will not occur. Upgrade by python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade
on the old version, no way?
You need to use new version to resolve it.
Same problem on the new version)
requirements.txt
cchardet==2.1.7
botasaurus-requests==4.0.16
bota==4.0.62
botasaurus==4.0.34
botasaurus_api==4.0.4
botasaurus_driver==4.0.30
botasaurus-proxy-authentication==1.0.16
botasaurus_server==4.0.23
deprecated==1.2.14
After every request
root@s# ps -A -ostat,pid,ppid | grep -e '[zZ]'
Z 3388440 3388338
Z 3388441 3388338
Z 3388443 3388338
Z 3388445 3388338
Z 3388450 3388338
Z 3388451 3388338
Z 3388452 3388338
Z 3388630 3388338
And with each request, they increase
code
from botasaurus.browser import browser, Driver
from botasaurus.request import request
from botasaurus_driver.user_agent import UserAgent
from botasaurus_driver.window_size import WindowSize
@request
def scrape_heading_task(requests, botasaurus_request: dict):
@browser(
block_images_and_css=True,
user_agent=botasaurus_request.get("user_agent") or UserAgent.RANDOM,
window_size=botasaurus_request.get("window_size") or WindowSize.RANDOM,
max_retry=botasaurus_request.get("max_retry"),
output=None,
add_arguments=["--disable-dev-shm-usage", "--no-sandbox"],
proxy=botasaurus_request.get("proxy") or None,
)
def scrape(driver: Driver, data):
driver.google_get(
link=botasaurus_request.get("url"),
bypass_cloudflare=bool(botasaurus_request.get("bypass_cloudflare")),
wait=botasaurus_request.get("wait"),
)
return {"text": driver.page_html, "cookies": driver.get_cookies()}
try:
return scrape()
except Exception as e:
return {"error": str(e)}
Dockerfile
FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1
COPY requirements.txt .
RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof xvfb
RUN mkdir app
WORKDIR /app
COPY . /app
CMD ["python", "run.py", "backend"]
OS: Ubuntu 22.04 LTS x86_64
running in docker
scrapers.py
import os
from botasaurus_server.server import Server
from src.scrape_heading_task import scrape_heading_task
Server.rate_limit["browser"] = os.getenv("MAX_BROWSERS", 3)
Server.add_scraper(scrape_heading_task)
scrape_heading_task.js
/**
* @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
*/
/**
* @param {Controls} controls
*/
function getInput(controls) {
controls.link('url', {isRequired: true})
controls.text('user_agent', {isRequired: false})
controls.listOfTexts('window_size', {isRequired: false})
controls.text('proxy', {isRequired: false})
controls.number('max_retry', {isRequired: false, defaultValue: 2})
controls.number('bypass_cloudflare', {isRequired: false, defaultValue: 0})
controls.number('wait', {isRequired: false, defaultValue: 5})
}
call
api = Api(server_url)
data = self.get_data(botasaurus_request)
task = api.create_async_task(
data=data,
scraper_name="scrape_heading_task",
)
result = self.get_task_result(
api,
task.get("id"),
botasaurus_request.timeout,
botasaurus_request.wait,
)
This issue occurs only in Docker, to resolve it run command
python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade
With this the zombie processes will be periodically purged, and won't reach more than 10 at any point.
Hi requirements.txt
zombie processes remain after execution
after 800 requests error:
or
How to make them deleted after requests ?