seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.32k stars 977 forks source link

Twitter scraping sometimes fails with selenium.common.exceptions.TimeoutException #2280

Closed fashionprivate closed 11 months ago

fashionprivate commented 11 months ago

Hello everybody,

I'm writing a bot to continuously scrape a Twitter account to retrieve the most recent tweets. I don't know why but the SeleniumBase library gives an error after about 1 minute of scraping without errors. Below is the code and the error found after about 1 minute of correct scraping:

import sys
import time
import traceback
from threading import Thread
import threading
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from seleniumbase import Driver

LOCK = threading.Lock()

def retrieve_messages_twitter(args):

    profile_url = args['profile_url']
    index_window = args['index_window']
    driver = args['driver']

    LOCK.acquire()
    driver.switch_to.window(driver.window_handles[index_window])
    try:
        driver.get(profile_url)
        last_tweet = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-testid='tweetText']"))).text
        print(last_tweet)
        # Add the tweet in a list
    except Exception as err:
        print(traceback.format_exc())
        sys.exit()
    else:
        LOCK.release()

    time.sleep(0.5)

    while 1:
        LOCK.acquire()
        try:
            driver.switch_to.window(driver.window_handles[index_window])
            driver.refresh()
            new_tweet = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div[data-testid='tweetText']"))).text

            # check some conditions on the tweet and if the condition is True, call a function

        except Exception as err:
            print(traceback.format_exc())
            sys.exit()
        else:
            LOCK.release()

        time.sleep(0.5)

driver = Driver(uc=True, headless=True)

# Login to Twitter

profile_urls = ['https://twitter.com/elonmusk', 'https://twitter.com/BillGates']

threads = []

for profile_url in range(len(profile_urls) - 1):
    driver.window_new()

for index, profile_url in enumerate(profile_urls):
    args = {}
    thread = None
    args['profile_url'] = profile_url
    args['driver'] = driver
    args['index_window'] = index
    thread = Thread(target = retrieve_messages_twitter, args = (args, ))
    threads.append(thread)
    thread.start()

for t in threads:
    t.join()

The error is the following:

screen

What is wrong? The strange thing is that it works for about a minute, then the error is always selenium.common.exceptions.TimeoutException: Message: and I can't handle it. Is it possible to reinitialize the driver? Are there other solutions?

mdmintz commented 11 months ago

One, Twitter already provides an API that you can use to scrape it: https://developer.twitter.com/en/docs/twitter-api.

Two, SeleniumBase methods have automatic-waiting. You should never be using the external implicitly_wait, WebDriverWait, or EC.presence_of_element_located anywhere in your code. Use the built-in methods instead. For the raw driver formats, see examples such as SeleniumBase/examples/raw_login_driver.py, SeleniumBase/examples/raw_driver_manager.py, and SeleniumBase/examples/offline_examples/test_extended_driver.py. For the pytest formats, see any example test that starts with test_ or ends with _test in the SeleniumBase/examples folder.