omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
https://www.omkar.cloud/botasaurus/
MIT License
1.14k stars 104 forks source link

Getting started with Botasaurus script throws lots of exceptions #24

Closed gameuser1982 closed 6 months ago

gameuser1982 commented 6 months ago

Description

I am just seeing a ton of exceptions trying to run the first Selenium scraping task that goes to https://www.omkar.cloud/ and grabs the h1 heading. It's the first Botasaurus script here:

from botasaurus import *

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")

    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/all.json
    return {
        "heading": heading
    }

if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

It's the first script in what is botasaurus: https://www.omkar.cloud/botasaurus/docs/what-is-botasaurus/

Steps to Reproduce

  1. Run python main.py

Expected behavior: [What you expect to happen]

Scrape the h1 heading and store it as a string called heading which is returned once the function is called (and presumably automatically saved into a json file by the botasaurus framework)

Actual behavior: [What actually happens]

Lots of errors:

(py311selenium) C:\py311seleniumbot>python main.py
Running

DevTools listening on ws://127.0.0.1:64985/devtools/browser/6520850b-e749-463b-9c45-8e5ecdea678e
[24816:3140:1224/150501.718:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[24816:3140:1224/150501.917:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable
Error getting page source: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580980]
        (No symbol) [0x00581F8D]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x008483B8]
        (No symbol) [0x008484DD]
        (No symbol) [0x00835818]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 431, in save_screenshot
    self.get_screenshot_as_file(
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 927, in get_screenshot_as_file
    png = self.get_screenshot_as_png()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 963, in get_screenshot_as_png
    return b64decode(self.get_screenshot_as_base64().encode('ascii'))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 975, in get_screenshot_as_base64
    return self.execute(Command.SCREENSHOT)['value']
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Failed to save screenshot
Failed for input: None
We've paused the browser to help you debug. Press 'Enter' to close.
Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\main.py", line 18, in <module>
    scrape_heading_task()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 443, in wrapper_browser
    current_result = run_task(data_item, False, 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 411, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 249, in close_driver
    driver.close()
    ^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
    self.execute(Command.CLOSE)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Reproduces how often: [What percentage of the time does it reproduce?]

It happens every time.

Additional context

I setup a virtual environment with botasaurus

gameuser1982 commented 6 months ago

Update: It's my own damn fault. I installed botasaurus into a virtual environment I had previously installed Selenium into stupidly thinking they could co-exist without conflict. Wrong wrong wrong.

Solution: I uninstalled botasaurus from my virtual environment that I had originally used selenium for. Created a new virtual environment and ONLY installed botasaurus.

Now script scrapes as expected, though the certificate parsing errors still exist therefore I am keeping this issue open. Do these cert errors mean that the website is being connected to insecurely or can it be safely ignored?

Here is the new output:

(py311botasaurus) C:\py311botasaurus>python main.py
Running
[INFO] Downloading Chrome Driver. This is a one-time process. Download in progress...

DevTools listening on ws://127.0.0.1:2309/devtools/browser/1ea8b6bd-45cd-4b14-af05-ef74b8bf8484
[6340:14368:1224/155004.893:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[6340:14368:1224/155005.099:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Written
     output/scrape_heading_task.json

(py311botasaurus) C:\py311botasaurus>
Chetan11-dev commented 6 months ago

Yes, these keep occurring. Ignore them, Also it wasn't your fault, I yesterday released buggy Code (fixed now), that's why it occurred.

gameuser1982 commented 6 months ago

Wow nice! Thanks for the quick reply on this! This is a pretty awesome framework and the scraping side of things makes sense to me!

Chetan11-dev commented 6 months ago

Thanks, a lot of awesomeness is on it's way that will seriously change the landscape of webscraping.