minimaxir / download-tweets-ai-text-gen

Python script to download public Tweets from a given Twitter account into a format suitable for AI text generation.
MIT License
219 stars 41 forks source link

Super Slow Download Speeds #30

Open commotum opened 4 years ago

commotum commented 4 years ago

I've noticed super slow download speeds and had problems with duplicates being downloaded. I'm running it right now with a .txt file of several users and it's been going for 8 hours already. Is there something with twint or twitter that could be causing this slowdown?

sdelgadoc commented 4 years ago

Some folks have been sharing their performance and according to this issue , and this issue people are collecting 1 - 2 tweets per second.

Is that the rate that you are seeing?

commotum commented 4 years ago

I'm getting a tweet every 1.56 seconds

commotum commented 4 years ago

Oldest Tweet: 2008-09-15 17:25:20: : 18440it [7:59:01, 1.56s/it]

commotum commented 4 years ago

In addition it's downloading duplicates for most tweets, so the process takes twice as long even at that rate. For example @dril has around 9,000 tweets, but I get over 18,000 tweets downloaded: 0it [00:00, ?it/s]Retrieving tweets for @dril... Oldest Tweet: 2008-09-15 17:25:20: : 18440it [7:59:01, 1.56s/it]

sdelgadoc commented 4 years ago

Downloading tweets at 1.6 tweets per second seems to be in line with what everyone in experiencing.

However, you shouldn't be getting duplicates. How are you counting tweets?

In this issue the user thought they were collecting more tweets than expected, but it was just due to them counting text file lines versus tweets. A tweet can have multiple lines in the file.

commotum commented 4 years ago

I opened the text file after download and it had duplicates running in series of about 5. So I would see 5 original tweets in a row and then the next five would be the exact same as the previous.

sdelgadoc commented 4 years ago

Ok, it sounds like something is going on.

It is hard for me to post fixes to this repo, so I'm going to ask you to try to reproduce the duplicate tweet issue using the code in the repo below. It has all the latest fixes in the unmerged pull request of this repo, and multi username functionality, which was removed from this repo.

https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus

If you can reproduce it using this new repo, please let me know what Twitter username you used, so I can debug it on my side.

commotum commented 4 years ago

I wasn't able to get the new repo to run, and had this error:

jake@eve:~/Desktop/download_tweets$ python3 download_tweets.py humour.txt 0it [00:00, ?it/s]Retrieving tweets for @dril... Oldest Tweet: 2009-12-02 02:15:13: : 18320it [7:50:25, 1.54s/it]Traceback (most recent call last): File "/home/jake/.local/lib/python3.8/site-packages/twint/get.py", line 162, in Response async with session.get(url, ssl=False, params=params, proxy=httpproxy) as response: File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/client.py", line 1012, in aenter self._resp = await self._coro File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/client.py", line 480, in _request conn = await self._connector.connect( File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/connector.py", line 523, in connect proto = await self._create_connection(req, traces, timeout) File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/connector.py", line 858, in _createconnection , proto = await self._create_direct_connection( File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/connector.py", line 980, in _create_direct_connection transp, proto = await self._wrap_create_connection( File "/home/jake/.local/lib/python3.8/site-packages/aiohttp/connector.py", line 936, in _wrap_create_connection return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa File "/usr/lib/python3.8/asyncio/base_events.py", line 1010, in create_connection sock = await self._connect_sock( File "/usr/lib/python3.8/asyncio/base_events.py", line 924, in _connect_sock await self.sock_connect(sock, address) File "/usr/lib/python3.8/asyncio/selector_events.py", line 494, in sock_connect return await fut asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "download_tweets.py", line 201, in fire.Fire(download_tweets) File "/home/jake/.local/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/jake/.local/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/jake/.local/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "download_tweets.py", line 85, in download_tweets tweets = download_account_tweets(username, limit, include_replies, strip_usertags, strip_hashtags, include_links) File "download_tweets.py", line 153, in download_account_tweets twint.run.Search(c) File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 292, in Search run(config, callback) File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 213, in run get_event_loop().run_until_complete(Twint(config).main(callback)) File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 154, in main await task File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 198, in run await self.tweets() File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 137, in tweets await self.Feed() File "/home/jake/.local/lib/python3.8/site-packages/twint/run.py", line 57, in Feed response = await get.RequestUrl(self.config, self.init, headers=[("User-Agent", self.user_agent)]) File "/home/jake/.local/lib/python3.8/site-packages/twint/get.py", line 107, in RequestUrl response = await Request(_url, params=params, connector=_connector, headers=headers) File "/home/jake/.local/lib/python3.8/site-packages/twint/get.py", line 157, in Request return await Response(session, url, params) File "/home/jake/.local/lib/python3.8/site-packages/twint/get.py", line 163, in Response return await response.text() File "/home/jake/.local/lib/python3.8/site-packages/async_timeout/init.py", line 45, in exit self._do_exit(exc_type) File "/home/jake/.local/lib/python3.8/site-packages/async_timeout/init.py", line 92, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError Oldest Tweet: 2009-12-02 02:15:13: : 18320it [7:52:25, 1.55s/it]

sdelgadoc commented 4 years ago

Thanks for sharing the error. It doesn't look like you're doing anything strange. You're running the script on a text file called humour.txt of which the first Twitter username is @Dril, and you were able to collect 18,320 tweets.

The error appears to be somewhere in the in the synchronization code of the twint library, which is hard for me to debug.

Let me run the same code and I'll see if I can recreate the issue.

sdelgadoc commented 4 years ago

I ran the script twice and wasn't able to reproduce the issue. It's worth sharing that I wasn't able to collect as many tweets as you did either. I collected 14,400 tweets and the oldest Tweet: 2012-12-13 16:12:19, while it looks like to collected 18,320 and went all the way back to 2009.

Twitter appears to behave differently for different people.

So, although I don't have a fix for you, I have a workaround so you can at least save tweets as they are being collected. If you hit the same error, you at least are left close to 20,000 tweets, which should be a good number to train the model.

Clone the development version of the repo I reference previously by doing the following:

git clone -b development https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus.git

The development version saves the tweets to file every 40 tweets, versus at the end. It also has a bunch of cool new features for adding sentiment data, and reply formating that you can take a look at in the README here:

https://github.com/sdelgadoc/download-tweets-ai-text-gen-plus/tree/development

Install two additional libraries by doing the following:

pip3 install -r requirements.txt

Run the download_tweets.py script just as you did before, and let me know how it goes.