vladkens / twscrape

2024! Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
784 stars 102 forks source link

HTTPX Connection Error #195

Open ErSauravAdhikari opened 1 month ago

ErSauravAdhikari commented 1 month ago

Have been receiving HTTPX connection error consistently while trying to scrape for long timeline.

2024-05-27T18:29:19.901628010Z [2024-05-27 18:29:19,900: ERROR/ForkPoolWorker-9] Task apps.twa.tasks.start_custom_query_processing[847bd934-5425-4c18-919f-951cf67d00ac] raised unexpected: ConnectError('')
2024-05-27T18:29:19.901662837Z Traceback (most recent call last):
2024-05-27T18:29:19.901670714Z   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 69, in map_httpcore_exceptions
2024-05-27T18:29:19.901680490Z     yield
2024-05-27T18:29:19.901689082Z   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 373, in handle_async_request
2024-05-27T18:29:19.901697917Z     resp = await self._pool.handle_async_request(req)
2024-05-27T18:29:19.901703327Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
2024-05-27T18:29:19.901722343Z     raise exc from None
2024-05-27T18:29:19.901727898Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
2024-05-27T18:29:19.901733636Z     response = await connection.handle_async_request(
2024-05-27T18:29:19.901738949Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http_proxy.py", line 317, in handle_async_request
2024-05-27T18:29:19.901744617Z     stream = await stream.start_tls(**kwargs)
2024-05-27T18:29:19.901749692Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_async/http11.py", line 383, in start_tls
2024-05-27T18:29:19.901756681Z     return await self._stream.start_tls(ssl_context, server_hostname, timeout)
2024-05-27T18:29:19.901766166Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 68, in start_tls
2024-05-27T18:29:19.901775624Z     with map_exceptions(exc_map):
2024-05-27T18:29:19.901780842Z   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
2024-05-27T18:29:19.901786414Z     self.gen.throw(typ, value, traceback)
2024-05-27T18:29:19.901791619Z   File "/usr/local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
2024-05-27T18:29:19.901797197Z     raise to_exc(exc) from exc
2024-05-27T18:29:19.901802405Z httpcore.ConnectError
2024-05-27T18:29:19.901807636Z 
2024-05-27T18:29:19.901812645Z The above exception was the direct cause of the following exception:
2024-05-27T18:29:19.901818152Z 
2024-05-27T18:29:19.901823044Z Traceback (most recent call last):
2024-05-27T18:29:19.901828172Z   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 453, in trace_task
2024-05-27T18:29:19.901833767Z     R = retval = fun(*args, **kwargs)
2024-05-27T18:29:19.901842271Z   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 736, in __protected_call__
2024-05-27T18:29:19.901852625Z     return self.run(*args, **kwargs)
2024-05-27T18:29:19.901858432Z   File "/app/apps/twa/tasks.py", line 120, in start_custom_query_processing
2024-05-27T18:29:19.901863986Z     raise e
2024-05-27T18:29:19.901870726Z   File "/app/apps/twa/tasks.py", line 115, in start_custom_query_processing
2024-05-27T18:29:19.901876443Z     loop.run_until_complete(run_scraper())
2024-05-27T18:29:19.901881694Z   File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
2024-05-27T18:29:19.901887351Z     return future.result()
2024-05-27T18:29:19.901892534Z   File "/app/apps/twa/tasks.py", line 108, in run_scraper
2024-05-27T18:29:19.901897954Z     await scraper.save_tweets_to_db()
2024-05-27T18:29:19.901903209Z   File "/app/logic/scraper/base_tweet_scraper.py", line 91, in save_tweets_to_db
2024-05-27T18:29:19.901908722Z     async for tweet in tweets_gen:
2024-05-27T18:29:19.901922674Z   File "/app/logic/scraper/base_tweet_scraper.py", line 77, in fetch_ticker_tweets
2024-05-27T18:29:19.901933195Z     async for tweet in self.api.search(query):
2024-05-27T18:29:19.901938668Z   File "/usr/local/lib/python3.10/site-packages/twscrape/api.py", line 156, in search
2024-05-27T18:29:19.901944460Z     async for rep in gen:
2024-05-27T18:29:19.901950020Z   File "/usr/local/lib/python3.10/site-packages/twscrape/api.py", line 151, in search_raw
2024-05-27T18:29:19.901955616Z     async for x in gen:
2024-05-27T18:29:19.901960841Z   File "/usr/local/lib/python3.10/site-packages/twscrape/api.py", line 117, in _gql_items
2024-05-27T18:29:19.901966346Z     rep = await client.get(f"{GQL_URL}/{op}", params=encode_params(params))
2024-05-27T18:29:19.901971886Z   File "/usr/local/lib/python3.10/site-packages/twscrape/queue_client.py", line 202, in get
2024-05-27T18:29:19.901977416Z     return await self.req("GET", url, params=params)
2024-05-27T18:29:19.901982946Z   File "/usr/local/lib/python3.10/site-packages/twscrape/queue_client.py", line 233, in req
2024-05-27T18:29:19.901988552Z     raise e
2024-05-27T18:29:19.901993529Z   File "/usr/local/lib/python3.10/site-packages/twscrape/queue_client.py", line 213, in req
2024-05-27T18:29:19.902002833Z     rep = await ctx.clt.request(method, url, params=params)
2024-05-27T18:29:19.902012607Z   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1574, in request
2024-05-27T18:29:19.902018228Z     return await self.send(request, auth=auth, follow_redirects=follow_redirects)
2024-05-27T18:29:19.902023604Z   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1661, in send
2024-05-27T18:29:19.902029053Z     response = await self._send_handling_auth(
2024-05-27T18:29:19.902034250Z   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1689, in _send_handling_auth
2024-05-27T18:29:19.902039840Z     response = await self._send_handling_redirects(
2024-05-27T18:29:19.902045094Z   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1726, in _send_handling_redirects
2024-05-27T18:29:19.902050753Z     response = await self._send_single_request(request)
2024-05-27T18:29:19.902056844Z   File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1763, in _send_single_request
2024-05-27T18:29:19.902082264Z     response = await transport.handle_async_request(request)
2024-05-27T18:29:19.902095556Z   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 372, in handle_async_request
2024-05-27T18:29:19.902101973Z     with map_httpcore_exceptions():
2024-05-27T18:29:19.902107170Z   File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
2024-05-27T18:29:19.902112650Z     self.gen.throw(typ, value, traceback)
2024-05-27T18:29:19.902117825Z   File "/usr/local/lib/python3.10/site-packages/httpx/_transports/default.py", line 86, in map_httpcore_exceptions
2024-05-27T18:29:19.902128738Z     raise mapped_exc(message) from exc
2024-05-27T18:29:19.902134145Z httpx.ConnectError
ErSauravAdhikari commented 1 month ago

Screenshot 2024-05-28 at 9 31 27 AM

Can we make it so that instead of crashing down, we use a different account for same request

ErSauravAdhikari commented 1 month ago

What might have happened was in a pool of accounts, each account is linked to a proxy and some proxies may end up not working (temporarily).

It will work at a later date, we can ignore this account for a while, and try with different account.

ErSauravAdhikari commented 1 month ago

I can add the retry mechanism client side, but I won't have access to which account was being used.

ErSauravAdhikari commented 1 month ago

If I did then I could have a retry logic to change the proxy by assigning a failover proxy for this.