omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
https://www.omkar.cloud/botasaurus/
MIT License
1.14k stars 103 forks source link

zombie processes #127

Closed Kaiden0001 closed 1 month ago

Kaiden0001 commented 1 month ago

Hi requirements.txt

botasaurus==4.0.14
botasaurus_server==4.0.19
cchardet==2.1.7

zombie processes remain after execution

@request
def scrape_heading_task(requests: AntiDetectRequests, botasaurus_request: dict):
    @browser(
        user_agent=botasaurus_request.get("user_agent") or bt.UserAgent.RANDOM,
        window_size=botasaurus_request.get("window_size") or bt.WindowSize.RANDOM,
        max_retry=botasaurus_request.get("max_retry"),
        add_arguments=["--disable-dev-shm-usage", "--no-sandbox", "--headless=new"],
        output=None,
        proxy=botasaurus_request.get("proxy") or None,
        create_driver=create_stealth_driver(
            start_url=botasaurus_request.get("url"),
            raise_exception=True,
            wait=botasaurus_request.get("wait"),
        ),
    )
    def scrape(driver: AntiDetectDriver, data):
        return {"text": driver.page_source, "cookies": driver.get_cookies()}

after 800 requests error:

[Errno 11] Resource temporarily unavailable

or

 ('launch', 'Error: spawnSync /bin/sh EAGAIN\n    at Object.spawnSync (node:internal/child_process:1117:20)\n    at spawnSync (node:child_process:876:24)\n    at execSync (node:child_process:957:15)\n    at findChromeExecutables (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:217:25)\n    at file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:103:46\n    at Array.forEach (<anonymous>)\n    at Module.linux (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:102:32)\n    at Launcher.getFirstInstallation (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:122:43)\n    at Launcher.launch (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:190:43)\n    at Module.launch (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:33:20)')

How to make them deleted after requests ?

Kaiden0001 commented 1 month ago

Dockerfile

FROM chetan1111/botasaurus:latest

ENV PYTHONUNBUFFERED=1

COPY requirements.txt .

RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof

RUN mkdir app
WORKDIR /app
COPY . /app

CMD ["python", "run.py", "backend"]
Chetan11-dev commented 1 month ago

Only solution is to upgrade to latest version, with that this error will not occur. Upgrade by python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade

Kaiden0001 commented 1 month ago

on the old version, no way?

Chetan11-dev commented 1 month ago

You need to use new version to resolve it.

Kaiden0001 commented 1 month ago

Same problem on the new version)

requirements.txt

cchardet==2.1.7
botasaurus-requests==4.0.16
bota==4.0.62
botasaurus==4.0.34
botasaurus_api==4.0.4
botasaurus_driver==4.0.30
botasaurus-proxy-authentication==1.0.16
botasaurus_server==4.0.23
deprecated==1.2.14

After every request

root@s# ps -A -ostat,pid,ppid | grep -e '[zZ]'
Z    3388440 3388338
Z    3388441 3388338
Z    3388443 3388338
Z    3388445 3388338
Z    3388450 3388338
Z    3388451 3388338
Z    3388452 3388338
Z    3388630 3388338

And with each request, they increase

Chetan11-dev commented 1 month ago
Kaiden0001 commented 1 month ago

code

from botasaurus.browser import browser, Driver
from botasaurus.request import request
from botasaurus_driver.user_agent import UserAgent
from botasaurus_driver.window_size import WindowSize

@request
def scrape_heading_task(requests, botasaurus_request: dict):
    @browser(
        block_images_and_css=True,
        user_agent=botasaurus_request.get("user_agent") or UserAgent.RANDOM,
        window_size=botasaurus_request.get("window_size") or WindowSize.RANDOM,
        max_retry=botasaurus_request.get("max_retry"),
        output=None,
        add_arguments=["--disable-dev-shm-usage", "--no-sandbox"],
        proxy=botasaurus_request.get("proxy") or None,
    )
    def scrape(driver: Driver, data):
        driver.google_get(
            link=botasaurus_request.get("url"),
            bypass_cloudflare=bool(botasaurus_request.get("bypass_cloudflare")),
            wait=botasaurus_request.get("wait"),
        )
        return {"text": driver.page_html, "cookies": driver.get_cookies()}

    try:
        return scrape()
    except Exception as e:
        return {"error": str(e)}

Dockerfile

FROM chetan1111/botasaurus:latest

ENV PYTHONUNBUFFERED=1

COPY requirements.txt .

RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof xvfb

RUN mkdir app
WORKDIR /app
COPY . /app

CMD ["python", "run.py", "backend"]

OS: Ubuntu 22.04 LTS x86_64

Chetan11-dev commented 1 month ago
Kaiden0001 commented 1 month ago

running in docker

scrapers.py

import os

from botasaurus_server.server import Server
from src.scrape_heading_task import scrape_heading_task

Server.rate_limit["browser"] = os.getenv("MAX_BROWSERS", 3)
Server.add_scraper(scrape_heading_task)

scrape_heading_task.js

/**
 * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
 */

/**
 * @param {Controls} controls
 */
function getInput(controls) {
    controls.link('url', {isRequired: true})
    controls.text('user_agent', {isRequired: false})
    controls.listOfTexts('window_size', {isRequired: false})
    controls.text('proxy', {isRequired: false})
    controls.number('max_retry', {isRequired: false, defaultValue: 2})
    controls.number('bypass_cloudflare', {isRequired: false, defaultValue: 0})
    controls.number('wait', {isRequired: false, defaultValue: 5})
}

call

        api = Api(server_url)

        data = self.get_data(botasaurus_request)

        task = api.create_async_task(
            data=data,
            scraper_name="scrape_heading_task",
        )
        result = self.get_task_result(
            api,
            task.get("id"),
            botasaurus_request.timeout,
            botasaurus_request.wait,
        )
Chetan11-dev commented 1 month ago

This issue occurs only in Docker, to resolve it run command

python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade

With this the zombie processes will be periodically purged, and won't reach more than 10 at any point.