omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
https://www.omkar.cloud/botasaurus/
MIT License
1.14k stars 103 forks source link

Node proxy-chain require timeout when using as celery worker task #114

Closed raunaqss closed 1 month ago

raunaqss commented 1 month ago

Hi there,

I'm trying to run a script using botasaurus initiated by a celery worker task. When I'm not using a proxy with botasaurus then it's working but when I use a proxy within the script then I get the following error:

[2024-05-14 17:06:34,757: WARNING/ForkPoolWorker-7] require [2024-05-14 17:06:34,758: WARNING/ForkPoolWorker-7] [2024-05-14 17:06:34,758: WARNING/ForkPoolWorker-7] <Thread(Thread-3 (loop), stopped daemon 140362408519232)> [2024-05-14 17:06:34,759: ERROR/ForkPoolWorker-7] An error occurred: Timed out accessing 'require' [2024-05-14 17:06:34,759: ERROR/ForkPoolWorker-7] Event data: {'url': 'https://example.com'} [2024-05-14 17:06:34,766: ERROR/ForkPoolWorker-7] Task scraping.celeryworkers.celery_main.callback[5aa6f898-cd0a-4dc2-a84d-907e3452a7e5] raised unexpected: Exception("Timed out accessing 'require'") Traceback (most recent call last): File "path/to/project/venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task R = retval = fun(*args, **kwargs) File "path/to/project/venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__ return self.run(*args, **kwargs) File "path/to/project/scraping/celeryworkers/celery_main.py", line 105, in callback raise e # Re-raise the exception to ensure AWS Lambda marks the invocation as failed File "path/to/project/scraping/celeryworkers/celery_main.py", line 46, in callback result = yelp_reviews_handler(event) File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/decorators.py", line 643, in wrapper_browser current_result = run_task(data_item, False, 0) File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/decorators.py", line 462, in run_task ) = create_options_and_driver_attributes_and_close_proxy( File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/create_driver_utils.py", line 303, in create_options_and_driver_attributes_and_close_proxy from botasaurus_proxy_authentication import add_proxy_options File "path/to/project/venv/lib/python3.10/site-packages/botasaurus_proxy_authentication/__init__.py", line 3, in <module> proxyChain = require("proxy-chain") File "path/to/project/venv/lib/python3.10/site-packages/javascript/__init__.py", line 37, in require return config.global_jsi.require(name, version, calling_dir, timeout=900) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 230, in __getattr__ methodType, val = self._exe.getProp(self._pffid, attr) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 150, in getProp resp = self.ipc("get", ffid, method) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 43, in ipc raise Exception(f"Timed out accessing '{attr}'") Exception: Timed out accessing 'require'

It would be really helpful if we could get a way to configure that timeout. I understand that this is a quite niche request but I just thought of putting it out here for record in any case.

Chetan11-dev commented 1 month ago

We have released v4, which allows you to easily make an API out of any Web Scraper. I suggest using it, to do so, please run the following commands:

python -m pip install bota botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade

Then read the documentation at https://github.com/omkarcloud/botasaurus.

raunaqss commented 1 month ago

After upgrade the site opens up unblocked but this driver works in a different way. No matter how long I wait the html elements that were being parsed in the previous version do not load.

Chetan11-dev commented 1 month ago

Could you share code, You need to use driver.select?

raunaqss commented 1 month ago

Yes here is the code. The exact same technique (except for migration related changes of course) works with the previous version of botasaurus. I usually get the page source and prefer selecting with bs4 itself.

from bs4 import BeautifulSoup
from decouple import config
from botasaurus.browser import browser, Driver
from botasaurus.user_agent import UserAgent
import time
import math
import re
from pprint import pprint
import dateparser

@browser(
    proxy=config('US_PROXY'),
    headless=True,
    user_agent=UserAgent.REAL,
    # block_images_and_css=True,
    create_error_logs=False,
    # wait_for_complete_page_load=True
)
def handler(driver: Driver, event, context=None):
    url = event.get('url')
    if not url:
        return "No URL provided"

    all_reviews = []
    current_page = 1
    per_page = 10
    domain_url = '/'.join(url.split('/')[:3])

    while True:
        offset = (current_page - 1) * 10
        endpoint_url = f"{url}?start={offset}&sort_by=date_desc#reviews"
        driver.get(endpoint_url)

        driver.long_random_sleep()  # Wait for the page to load
        driver.long_random_sleep()

        html_source = driver.page_html
        soup = BeautifulSoup(html_source, 'html.parser')

        page_data = parse_html(soup, domain_url)
        all_reviews.extend(page_data['reviews'])

        if current_page >= page_data['no_of_pages']:
            break
        current_page += 1
raunaqss commented 1 month ago

Actually, the source website changed their content. I checked the backup parser also has the same issue. Sorry for false positive.

Chetan11-dev commented 1 month ago

No, worries