Closed raunaqss closed 1 month ago
We have released v4, which allows you to easily make an API out of any Web Scraper. I suggest using it, to do so, please run the following commands:
python -m pip install bota botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade
Then read the documentation at https://github.com/omkarcloud/botasaurus.
After upgrade the site opens up unblocked but this driver works in a different way. No matter how long I wait the html elements that were being parsed in the previous version do not load.
Could you share code, You need to use driver.select?
Yes here is the code. The exact same technique (except for migration related changes of course) works with the previous version of botasaurus. I usually get the page source and prefer selecting with bs4 itself.
from bs4 import BeautifulSoup
from decouple import config
from botasaurus.browser import browser, Driver
from botasaurus.user_agent import UserAgent
import time
import math
import re
from pprint import pprint
import dateparser
@browser(
proxy=config('US_PROXY'),
headless=True,
user_agent=UserAgent.REAL,
# block_images_and_css=True,
create_error_logs=False,
# wait_for_complete_page_load=True
)
def handler(driver: Driver, event, context=None):
url = event.get('url')
if not url:
return "No URL provided"
all_reviews = []
current_page = 1
per_page = 10
domain_url = '/'.join(url.split('/')[:3])
while True:
offset = (current_page - 1) * 10
endpoint_url = f"{url}?start={offset}&sort_by=date_desc#reviews"
driver.get(endpoint_url)
driver.long_random_sleep() # Wait for the page to load
driver.long_random_sleep()
html_source = driver.page_html
soup = BeautifulSoup(html_source, 'html.parser')
page_data = parse_html(soup, domain_url)
all_reviews.extend(page_data['reviews'])
if current_page >= page_data['no_of_pages']:
break
current_page += 1
Actually, the source website changed their content. I checked the backup parser also has the same issue. Sorry for false positive.
No, worries
Hi there,
I'm trying to run a script using botasaurus initiated by a celery worker task. When I'm not using a proxy with botasaurus then it's working but when I use a proxy within the script then I get the following error:
[2024-05-14 17:06:34,757: WARNING/ForkPoolWorker-7] require [2024-05-14 17:06:34,758: WARNING/ForkPoolWorker-7] [2024-05-14 17:06:34,758: WARNING/ForkPoolWorker-7] <Thread(Thread-3 (loop), stopped daemon 140362408519232)> [2024-05-14 17:06:34,759: ERROR/ForkPoolWorker-7] An error occurred: Timed out accessing 'require' [2024-05-14 17:06:34,759: ERROR/ForkPoolWorker-7] Event data: {'url': 'https://example.com'} [2024-05-14 17:06:34,766: ERROR/ForkPoolWorker-7] Task scraping.celeryworkers.celery_main.callback[5aa6f898-cd0a-4dc2-a84d-907e3452a7e5] raised unexpected: Exception("Timed out accessing 'require'") Traceback (most recent call last): File "path/to/project/venv/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task R = retval = fun(*args, **kwargs) File "path/to/project/venv/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__ return self.run(*args, **kwargs) File "path/to/project/scraping/celeryworkers/celery_main.py", line 105, in callback raise e # Re-raise the exception to ensure AWS Lambda marks the invocation as failed File "path/to/project/scraping/celeryworkers/celery_main.py", line 46, in callback result = yelp_reviews_handler(event) File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/decorators.py", line 643, in wrapper_browser current_result = run_task(data_item, False, 0) File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/decorators.py", line 462, in run_task ) = create_options_and_driver_attributes_and_close_proxy( File "path/to/project/venv/lib/python3.10/site-packages/botasaurus/create_driver_utils.py", line 303, in create_options_and_driver_attributes_and_close_proxy from botasaurus_proxy_authentication import add_proxy_options File "path/to/project/venv/lib/python3.10/site-packages/botasaurus_proxy_authentication/__init__.py", line 3, in <module> proxyChain = require("proxy-chain") File "path/to/project/venv/lib/python3.10/site-packages/javascript/__init__.py", line 37, in require return config.global_jsi.require(name, version, calling_dir, timeout=900) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 230, in __getattr__ methodType, val = self._exe.getProp(self._pffid, attr) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 150, in getProp resp = self.ipc("get", ffid, method) File "path/to/project/venv/lib/python3.10/site-packages/javascript/proxy.py", line 43, in ipc raise Exception(f"Timed out accessing '{attr}'") Exception: Timed out accessing 'require'
It would be really helpful if we could get a way to configure that timeout. I understand that this is a quite niche request but I just thought of putting it out here for record in any case.