robertoszek / pleroma-bot

Bot for mirroring one or multiple Twitter accounts in Pleroma/Mastodon/Misskey.
https://robertoszek.github.io/pleroma-bot
MIT License
104 stars 18 forks source link

allow users to gather tweets from a shared user timeline #99

Open us3r1d opened 1 year ago

us3r1d commented 1 year ago

OK, this one may be problematic. Feel free to tell me if it isn't workable. :-)

I'm hoping to minimize latency between the time a tweet is posted and the time it's mirrored into a fediverse post, so I'm hitting the 300 requests per 15 minutes rate limiter pretty often. I'm guessing this is because the bot is making a tweets requests from Twitter for each configured account, and I have 45 in there at the moment and would like to be able to support way more than that.

(This is for https://twitter.oksocial.net/about; that page describes the service.)

I'm pretty sure Twitter's API lets you give it a list of accounts to pull tweets from on each request, rather than just a single account, so it should be possible for the bot to batch all the accounts for which it does not have specific login info into a single request?

I suspect that's much more complicated than adding the bot attribute was, but thanks for considering it. :-)

-robin

robertoszek commented 1 year ago

It is indeed quite a drastic change that would need a bit of work and time to integrate nicely.

We could optimize a little bit more the gathering of tweets but even then, after scaling up to a certain number of accounts you'll run into Twitter's API rate limits again eventually.

I'm wondering if in the meantime you could mitigate it somewhat by using RSS feeds as the account source for tweets instead of Twitter's API: https://github.com/robertoszek/pleroma-bot/blob/develop/docs/gettingstarted/usage.md#using-an-rss-feed This feature is on the rc release but hasn't been rolled into stable yet. It has some limitations though, you won't be able to mirror polls, know which tweets are pinned and it won't work for accounts that have their tweets protected.

us3r1d commented 1 year ago

This project won't be connecting to any protected accounts anyway, and polls may not even matter since there's no way for that data to get back onto Twitter, so those are probably no biggie; I'll look at the RSS option.

If the API still supports batching, that would obviously be more desirable in the long term. I haven't seen their API directly in something like 10 years though, so my ideas on how it works are way out of date.

(The temporary solution I looked at first is from my other issue today: running multiple bot instances connected to different API apps. It looks like you've already done that one, so yay! :-)

us3r1d commented 1 year ago

Ugh; it looks like the standard API doesn't have a way to make a batch request.

Search might work, but a simpler approach would be:

a) have a global setting in the bot config for a user whose timeline should be scraped for tweets (since "fetch a user timeline" is a single query) before processing the configured users which would then be broken out by what user tweeted them if that user matches one in the bot config

b) have a per-user setting in the bot config that says whether to fetch this user's tweets separately or to use tweets from the globally configured timeline

That way I could set up a single Twitter account that follows all the accounts I want batched.

It would also maybe minimize the impact on the bot's processing path; it makes me do the work of setting up a timeline to scrape, so the optimization workload is on me instead of you. :-)

robertoszek commented 1 year ago

Oh, I forgot to mention if you're gonna try the RSS feature maybe do so on the latest rc version (1.1.1rc29). It includes some improvements to it and multithreading when processing the tweets present on the RSS feed.

robertoszek commented 1 year ago

Search might work, but a simpler approach would be: [...]

Hmmm, I'm a little torn about this. In one hand it would help relieve the requests load on the Twitter's API, but on the other hand I feel like this would overcomplicate further an already confusing and hard to understand bot and config (my bad on that front 😅).

How do you envision a config looking like if using the timeline approach you suggest, something along the lines of this?:

pleroma_base_url: https://pleroma.instance
max_tweets: 40
timeline_user: TwitterUserFollowingAccounts
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  use_timeline: true
- twitter_username: User2
  pleroma_username: MyPleromaUser2
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
us3r1d commented 1 year ago

Yeah, that's pretty much what I was thinking. "shared_timeline_user" and "use_shared_timeline" might be more clear?

robertoszek commented 1 year ago

Excellent. Yeah, the names are subject to change, just wanted to make sure I understood what you were going for.

On a related note, I've been also experimenting with Guest Tokens as another way of circumventing Twitter's API rate limits (and for people who don't want to apply for a dev account): 74def6f19c182cf4a76f835311662edcf8d94a9f

If you have no twitter_token in your config or set the guest mapping to true (globally or per-user), it will generate a Guest Token for every user:

pleroma_base_url: https://pleroma.instance
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  guest: true # <---

It's limited to 20 tweets (or I haven't figured out how to force it to paginate with the cursor yet). But if you're looking for decreasing latency between tweet and mirroring maybe it's worth looking at, as I haven't ran into rate limits no matter how many users I used on the config (as it generates a fresh token for each one).

You can try it for yourself by installing 1.1.1rc30: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc30 Fair warning, it may be broken in a few different ways I haven't found yet.

us3r1d commented 1 year ago

If that works, it should do the trick.

I converted my config to guest and ran with no rate-limiter hits, though I do seem to have gotten re-posts of recent tweets on some accounts. For example:

https://twitter.oksocial.net/loresjoberg https://twitter.oksocial.net/HAL9000_

(In total, it looks like maybe 10 out of 46 accounts ended up with a re-post.)

us3r1d commented 1 year ago

That did not go well. :-)

I'm running with a script that rebuilds the bot config files, runs the bot then sleeps 5 minutes; in that setup, it ran one pass successfully as guest, then all subsequent runs got this for all accounts:

Error log ```shell ℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - ====================================== INFO:pleroma_bot:====================================== ℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - Processing user: adamconover INFO:pleroma_bot:Processing user: adamconover DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.twitter.com:443 DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "POST /1.1/guest/activate.json HTTP/1.1" 429 69 ✖ 2022-11-28 10:00:35,775 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:700) Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main user = User(user_item, config, base_path, posts_ids) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__ guest_token, headers = self._get_guest_token_header() File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header guest_token = json_resp['guest_token'] KeyError: 'guest_token' ERROR:pleroma_bot:Exception occurred for user, skipping... Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main user = User(user_item, config, base_path, posts_ids) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__ guest_token, headers = self._get_guest_token_header() File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header guest_token = json_resp['guest_token'] KeyError: 'guest_token' ````
robertoszek commented 1 year ago

Looks like it hit a 429 when requesting a guest token: DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "POST /1.1/guest/activate.json HTTP/1.1" 429 69

How many users would you say you run it with in the span of 15min? I may try to replicate it on my side too.

us3r1d commented 1 year ago

There are 46 accounts on it at the moment, so that'd presumably be 92 to 138 attempts depending on how the timing goes?

robertoszek commented 1 year ago

After some testing, if I randomize the user agent slightly I'm getting 1000 requests for a new guest token before getting rate limited.

In addition to that, I've also added retrying with proxies once you've hit an 429. This really only helps when using guest tokens (with an app token your request count goes up no matter what the source IP happens to be): https://github.com/robertoszek/pleroma-bot/commit/7f062d7a3b00ce1f096887665ec1868eb417b522

They are configurable with the proxy_pool mapping but if it's not present some free proxies will be used instead (and you can disable it completely setting proxy to false):

proxy_pool:
- 128.199.221.6:443
- 164.62.72.90:80
- 178.128.121.196:443
pleroma_base_url: https://pleroma.instance
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  proxy: false
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hopefully that would help alleviate your rate limit issue a bit, these changes are included in 1.1.1rc35: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc35

us3r1d commented 1 year ago

Thanks; I'll give that rc a try.

robertoszek commented 1 year ago

Oh and by the way, if I had to guess the re-posts probably were due to some timestamps not being transformed correctly to UTC format. So timezones were probably wrongly offsetting the start and now dates: https://github.com/robertoszek/pleroma-bot/commit/99577088554e3a499a55aa9c3110466e5e5999d6

This change is included in 1.1.1rc37: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc37

us3r1d commented 1 year ago

That didn't work; I left it running unattended for an hour and it doesn't seem to have hit the rate limiter but also didn't post anything. When I pulled guest:true out of the config it caught up with what it had missed.

This was with rc35, so I'll re-try in a bit with rc37.

Thanks.

robertoszek commented 1 year ago

Timezones are fun, I had to force it to UTC otherwise it would use the local timezone when parsing the start date into an UTC epoch timestamp: https://github.com/robertoszek/pleroma-bot/commit/62502f1abdf66a5eb2987f386e7fa9b9b4640ac8

If it still happens on 1.1.1rc38 let me know.

us3r1d commented 1 year ago

rc38 is doing better; it doesn't seem to be missing tweets.

I see it doing the rollover to public proxies:

âš  2022-11-29 09:07:13,423 - pleroma_bot - WARNING - Rate limit exceeded when getting guest token. Retrying with a proxy. (_utils.py:1095)

That's a neat feature, but for my project I'm not happy about depending on someone else's proxy; I wouldn't want to cause anyone else trouble. That's my problem to deal with, though. :-)

This seems to be viable for running every 5 minutes at the moment.

I do think that batching tweets from a user timeline is a better strategy in the long run, but this fix is working for now.

Thanks.

robertoszek commented 1 year ago

For sure, this was meant just as an stopgap for your usecase because batching and user timelines will take me a while to implement. (And I also happened to be investigating guest tokens anyway for people who would rather not apply for a dev account)

I still agree the timeline approach is something we want to pursue and would be a nice option when using the bot. I'll change the title of the issue to reflect that if you're ok with that.

Oh, just a last remark. If you happen to have access to or run private proxies, putting them into the proxy_pool mapping will force the bot to only make use of the ones listed there, if you don't want to rely on the public ones.

us3r1d commented 1 year ago

It did just crash out with this error:

Error log ```shell ✖ 2022-11-29 12:57:33,853 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:707) multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets _get_rt_media_url(self, tweet, media) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url tweet_rt = self._get_tweets("v2", tweet_id) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2 params=params File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request "Rate limit exceeded. 0 out of {} requests remaining until {}" TypeError: 'list' object is not callable """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main tweets, user, threads File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel p.imap_unordered(user.process_tweets, tweets_chunked) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next raise value TypeError: 'list' object is not callable ERROR:pleroma_bot:Exception occurred for user, skipping... multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets _get_rt_media_url(self, tweet, media) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url tweet_rt = self._get_tweets("v2", tweet_id) File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2 params=params File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request "Rate limit exceeded. 0 out of {} requests remaining until {}" TypeError: 'list' object is not callable """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main tweets, user, threads File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel p.imap_unordered(user.process_tweets, tweets_chunked) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next raise value TypeError: 'list' object is not callable ```

(Occasional crashes don't bother me much, but I figured you'd like to know.)

robertoszek commented 1 year ago

File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request "Rate limit exceeded. 0 out of {} requests remaining until {}" TypeError: 'list' object is not callable

Ah, of course, the requests using the guest tokens don't contain the same rate limiting headers as the proper API when hitting an 429 (for whatever reason). I changed the structure around a bit to account for that: https://github.com/robertoszek/pleroma-bot/commit/db104cd3dc39abdb7110a105d865b3b86e1961b5

It should be included in 1.1.1rc40.