twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.86k stars 2.73k forks source link

[ERROR] Scrape a certain user's tweets but return 443 connection error #783

Open icmpnorequest opened 4 years ago

icmpnorequest commented 4 years ago

Error Report: Scraping a certain user's tweets but return 443 connection error

Initial Check

Command Ran

$ twint -u realDonaldTrump -o trump.txt

Description of Issue

I wanna scrape tweets of Trump, however, I could get nothing but 443 connection error. Does it matter with the change of frontend API modified by Twitter?

The error is as following:

CRITICAL:root:twint.get:User:Cannot connect to host twitter.com:443 ssl:True [Connect call failed ('31.13.66.23', 443)]
Traceback (most recent call last):
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 936, in _wrap_create_connection
    return await self._loop.create_connection(*args, **kwargs)  # type: ignore  # noqa
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 959, in create_connection
    raise exceptions[0]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 946, in create_connection
    await self.sock_connect(sock, address)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 464, in sock_connect
    return await fut
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 494, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
TimeoutError: [Errno 60] Connect call failed ('31.13.66.23', 443)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/bin/twint", line 11, in <module>
    load_entry_point('twint==2.1.20', 'console_scripts', 'twint')()
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/cli.py", line 305, in run_as_command
    main()
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/cli.py", line 297, in main
    run.Search(c)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 327, in Search
    run(config, callback)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 226, in run
    get_event_loop().run_until_complete(Twint(config).main(callback))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 154, in main
    await task
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 199, in run
    await self.tweets()
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 137, in tweets
    await self.Feed()
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/run.py", line 45, in Feed
    response = await get.RequestUrl(self.config, self.init, headers=[("User-Agent", self.user_agent)])
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/get.py", line 119, in RequestUrl
    response = await Request(_url, params=params, connector=_connector, headers=headers)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/get.py", line 143, in Request
    return await Response(session, url, params)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/twint/get.py", line 148, in Response
    async with session.get(url, ssl=True, params=params, proxy=httpproxy) as response:
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/client.py", line 1012, in __aenter__
    self._resp = await self._coro
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/client.py", line 483, in _request
    timeout=real_timeout
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 523, in connect
    proto = await self._create_connection(req, traces, timeout)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 859, in _create_connection
    req, traces, timeout)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 1004, in _create_direct_connection
    raise last_exc
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 986, in _create_direct_connection
    req=req, client_error=client_error)
  File "/Users/yantong/PycharmProjects/crawler-learning/venv/lib/python3.7/site-packages/aiohttp/connector.py", line 943, in _wrap_create_connection
    raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host twitter.com:443 ssl:True [Connect call failed ('31.13.66.23', 443)]

Environment Details

schults commented 4 years ago

Hello @icmpnorequest, I was with this same error and found a code from someone here in the community ! It works like a charm, you can scrap all day/night.

import pandas as pd
import twint
from datetime import datetime, timedelta
from time import sleep
import os

#query = 'Words OR To OR Search OR here'
start_str = "2020-04-01"
end_str = "2020-06-25"
start_date = pd.to_datetime(start_str, format='%Y-%m-%d', errors='ignore')
end_date = pd.to_datetime(end_str, format='%Y-%m-%d', errors='ignore')
data_folder = "/Path/To/Save/"
filename = f"{data_folder}collect_tweets_{start_str}_{end_str}.txt"
resume_file = f"{data_folder}resume.txt"

c = twint.Config()
c.Verified = True
c.Retweets = False
c.Filter_retweets = False 
c.Hide_output = False
c.Output = filename
c.Resume = resume_file
c.Search = query
c.Lang = 'en'
c.Links = "exclude"
#c.Custom["tweet"] = ["tweet"]
c.Format = "{tweet}"

while start_date < end_date:

    check = 0
    c.Since = datetime.strftime(start_date, format='%Y-%m-%d')
    c.Until = datetime.strftime(start_date + timedelta(days=1), format='%Y-%m-%d')

    while check < 1:
        try:
            print("Running Search: Check ", start_date)
            twint.run.Search(c)
            check += 1

        except Exception as e:
            # pause when twitter blocks further scraping
            print(e, "Sleeping for 7 mins")
            print("Check: ", check)
            sleep(420)

    # before iterating to the next day, remove the resume file
    os.remove(resume_file)

    # increment the start date by one day
    start_date = start_date + timedelta(days=1)