omkarcloud / botasaurus

The All in One Framework to build Awesome Scrapers.
MIT License
1.14k stars 104 forks source link

Getting started with Botasaurus script throws lots of exceptions #24

Closed gameuser1982 closed 6 months ago

gameuser1982 commented 6 months ago


I am just seeing a ton of exceptions trying to run the first Selenium scraping task that goes to and grabs the h1 heading. It's the first Botasaurus script here:

from botasaurus import *

def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website

    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/all.json
    return {
        "heading": heading

if __name__ == "__main__":
    # Initiate the web scraping task

It's the first script in what is botasaurus:

Steps to Reproduce

  1. Run python

Expected behavior: [What you expect to happen]

Scrape the h1 heading and store it as a string called heading which is returned once the function is called (and presumably automatically saved into a json file by the botasaurus framework)

Actual behavior: [What actually happens]

Lots of errors:

(py311selenium) C:\py311seleniumbot>python

DevTools listening on ws://
[24816:3140:1224/] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[24816:3140:1224/] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 377, in run_task
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 250, in close_driver
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 470, in quit
TypeError: 'bool' object is not callable
Error getting page source: Message: invalid session id
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580980]
        (No symbol) [0x00581F8D]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x008483B8]
        (No symbol) [0x008484DD]
        (No symbol) [0x00835818]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 377, in run_task
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 250, in close_driver
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 470, in quit
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 431, in save_screenshot
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 927, in get_screenshot_as_file
    png = self.get_screenshot_as_png()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 963, in get_screenshot_as_png
    return b64decode(self.get_screenshot_as_base64().encode('ascii'))
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 975, in get_screenshot_as_base64
    return self.execute(Command.SCREENSHOT)['value']
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 429, in execute
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Failed to save screenshot
Failed for input: None
We've paused the browser to help you debug. Press 'Enter' to close.
Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 377, in run_task
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 250, in close_driver
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 470, in quit
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\", line 18, in <module>
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 443, in wrapper_browser
    current_result = run_task(data_item, False, 0)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 411, in run_task
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\", line 249, in close_driver
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 551, in close
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 429, in execute
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Reproduces how often: [What percentage of the time does it reproduce?]

It happens every time.

Additional context

I setup a virtual environment with botasaurus

gameuser1982 commented 6 months ago

Update: It's my own damn fault. I installed botasaurus into a virtual environment I had previously installed Selenium into stupidly thinking they could co-exist without conflict. Wrong wrong wrong.

Solution: I uninstalled botasaurus from my virtual environment that I had originally used selenium for. Created a new virtual environment and ONLY installed botasaurus.

Now script scrapes as expected, though the certificate parsing errors still exist therefore I am keeping this issue open. Do these cert errors mean that the website is being connected to insecurely or can it be safely ignored?

Here is the new output:

(py311botasaurus) C:\py311botasaurus>python
[INFO] Downloading Chrome Driver. This is a one-time process. Download in progress...

DevTools listening on ws://
[6340:14368:1224/] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[6340:14368:1224/] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate


(py311botasaurus) C:\py311botasaurus>
Chetan11-dev commented 6 months ago

Yes, these keep occurring. Ignore them, Also it wasn't your fault, I yesterday released buggy Code (fixed now), that's why it occurred.

gameuser1982 commented 6 months ago

Wow nice! Thanks for the quick reply on this! This is a pretty awesome framework and the scraping side of things makes sense to me!

Chetan11-dev commented 6 months ago

Thanks, a lot of awesomeness is on it's way that will seriously change the landscape of webscraping.