robertoszek / pleroma-bot

Bot for mirroring one or multiple Twitter accounts in Pleroma/Mastodon/Misskey.
https://robertoszek.github.io/pleroma-bot
MIT License
104 stars 18 forks source link

Some tweets have their links or media skipped (unified cards) #79

Open nemobis opened 2 years ago

nemobis commented 2 years ago

Some fancy accounts seem to be using some Twitter feature which pleroma-bot doesn't support yet.

This is typically spotted in tweets which follow the trend of containing a mere "↓" as warning that the main content of the update is actually somewhere else, like this: https://respublicae.eu/@EU_Commission/108092396818818757 https://nitter.eu/EU_Commission/status/1512123194785898503 which is just a link to https://ec.europa.eu/commission/presscorner/detail/en/statement_22_2331 . These tweets look like just any other tweet whose main URL has been "eaten" by Twitter and shown only as attached "card", but they seem to be different.

Others are more complicated like https://respublicae.eu/@EU_Commission/108103776666586079 https://nitter.eu/EU_Commission/status/1512777762909655043 which contains a "broadcast": https://nitter.eu/i/broadcasts/1BRJjnyZoZdJw . I guess there isn't much to do about these, other than documenting it somewhere so that people make informed decisions about the nitter and signature configs.

robertoszek commented 2 years ago

Yeah, Twitter v2 API's response for the example tweet you provided (1512123194785898503) doesn't seem to include the link anywhere (even with all the expansions set):

{
   "data":[
      {
         "conversation_id":"1512123194785898503",
         "text":"President @vonderleyen has visited Stockholm to give the green light to Sweden's €3.3 billion recovery and resilience plan.\n\nSweden is a renewable energy pioneer. \n\nRenewables are bound to make up half of the country's energy mix by the end of the decade. ↓\n\n#NextGenerationEU",
         "lang":"en",
         "entities":{
            "mentions":[
               {
                  "start":10,
                  "end":22,
                  "username":"vonderleyen",
                  "id":"1146329871418843136"
               }
            ],
            "hashtags":[
               {
                  "start":259,
                  "end":276,
                  "tag":"NextGenerationEU"
               }
            ],
            "annotations":[
               {
                  "start":35,
                  "end":43,
                  "probability":0.9802,
                  "type":"Place",
                  "normalized_text":"Stockholm"
               },
               {
                  "start":72,
                  "end":77,
                  "probability":0.9972,
                  "type":"Place",
                  "normalized_text":"Sweden"
               },
               {
                  "start":125,
                  "end":130,
                  "probability":0.9456,
                  "type":"Place",
                  "normalized_text":"Sweden"
               }
            ]
         },
         "public_metrics":{
            "retweet_count":59,
            "reply_count":25,
            "like_count":197,
            "quote_count":2
         },
         "created_at":"2022-04-07T17:40:38.000Z",
         "possibly_sensitive":false,
         "id":"1512123194785898503",
         "source":"Twitter for Advertisers.",
         "author_id":"157981564",
         "context_annotations":[
            {
               "domain":{
                  "id":"10",
                  "name":"Person",
                  "description":"Named people in the world like Nelson Mandela"
               },
               "entity":{
                  "id":"1151432219002454016",
                  "name":"Ursula von der Leyen",
                  "description":"President of European Commission"
               }
            },
            {
               "domain":{
                  "id":"35",
                  "name":"Politician",
                  "description":"Politicians in the world, like Joe Biden"
               },
               "entity":{
                  "id":"1151432219002454016",
                  "name":"Ursula von der Leyen",
                  "description":"President of European Commission"
               }
            },
            {
               "domain":{
                  "id":"30",
                  "name":"Entities [Entity Service]",
                  "description":"Entity Service top level domain, every item that is in Entity Service should be in this domain"
               },
               "entity":{
                  "id":"848920371311001600",
                  "name":"Technology",
                  "description":"Technology and computing"
               }
            },
            {
               "domain":{
                  "id":"30",
                  "name":"Entities [Entity Service]",
                  "description":"Entity Service top level domain, every item that is in Entity Service should be in this domain"
               },
               "entity":{
                  "id":"848920371311001600",
                  "name":"Technology",
                  "description":"Technology and computing"
               }
            },
            {
               "domain":{
                  "id":"30",
                  "name":"Entities [Entity Service]",
                  "description":"Entity Service top level domain, every item that is in Entity Service should be in this domain"
               },
               "entity":{
                  "id":"898654185146560512",
                  "name":"Energy Technology",
                  "description":"Energy Technology"
               }
            }
         ]
      }
   ],
   "includes":{
      "users":[
         {
            "id":"157981564",
            "name":"European Commission 🇪🇺",
            "username":"EU_Commission"
         },
         {
            "id":"1146329871418843136",
            "name":"Ursula von der Leyen",
            "username":"vonderleyen"
         }
      ],
      "tweets":[

      ],
      "media":[

      ],
      "polls":[

      ]
   },
   "meta":{
      "result_count":1
   }
}

It looks like the only way to obtain info about the cards is using the Twitter Ads API: https://developer.twitter.com/en/docs/twitter-ads-api/creatives/guides/identifying-cards

And that would require to apply and create an additional Twitter Ads API application (with a separate token, etc.) 😖

nemobis commented 2 years ago

Wow, that's nasty! No wonder nitter is forced to use the "unofficial API" aka web scraping. https://github.com/zedeus/nitter/commit/111927a21cfdebbe3b67d81f3336ae7d342b4f8b

robertoszek commented 1 year ago

Funnily enough, I'm able to get some card metadata with the endpoints used by guest tokens. So I've made some changes to extract the URL and media from a card: https://github.com/robertoszek/pleroma-bot/commit/e8152114b77e91243ef8d3528561cd6f94165826

You can try it out on 1.1.1rc47: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc47 Keep in mind it will only work when using guest tokens (either by omitting the twitter_token mapping or adding guest: true in your config).

nemobis commented 1 year ago

Il 04/12/22 22:34, robertoszek ha scritto:

Keep in mind it will only work when using guest tokens (either by omitting the twitter_token mapping or adding guest: true in your config).

Will the usual tokens still be used for the rest of the calls? If not I guess I should use this only for the accounts which have this issue.

robertoszek commented 1 year ago

No, if an user in your config is marked as "guest", it will use the guest token on all the calls associated to that user.

I've been working a bit more on it to get this feature ready for the next stable release: get pinned tweet if using guest token (https://github.com/robertoszek/pleroma-bot/commit/5b2983291cb0163515d027ffeda54987b438e0e2) get poll from card if using guest token (https://github.com/robertoszek/pleroma-bot/commit/2ade63b5d8b763030c4d6ebbe180a1b94f42d8f2) Those commits are included in 1.1.1rc48.

So the current limitations are listed here: https://github.com/robertoszek/pleroma-bot/blob/develop/docs/gettingstarted/beforerunning.md#guest-tokens

The inability of obtaining protected tweets makes sense, as it will never work with a guest token.

So the only main difference between using regular Twitter tokens and the guests ones is the 20 tweet limit per user, which I'm going to try to find if there's a way around it.

robertoszek commented 1 year ago

I figured out how to force it to paginate using guest tokens: https://github.com/robertoszek/pleroma-bot/commit/57aece6bd5f913c9c898dc12035f90c74c6cb6f8

I've managed to gather more than 4000 tweets for an user using this method, not sure if it has a hard limit (apart from hitting rate limits).

That commit is included in version 1.1.1rc49.

nemobis commented 1 year ago

Il 05/12/22 02:50, robertoszek ha scritto:

That commit is included in version 1.1.1rc49.

I might be doing something wrong but it gives me a bunch of

✖ 2022-12-05 12:35:03,356 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:707) Traceback (most recent call last): File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/cli.py", line 549, in main

 user = User(user_item, config, base_path, posts_ids)

File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/cli.py", line 278, in init self._get_twitter_info()

File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/_twitter.py", line 169, in _get_twitter_info self._get_twitter_info_guest()

File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/_twitter.py", line 149, in _get_twitter_info_guest self.pinned_tweet_id = user_twitter["pinned_tweet_ids_str"][0] IndexError: list index out of range

robertoszek commented 1 year ago

Hmm...

Does running version 1.1.1rc52 make any difference?

nemobis commented 1 year ago

Il 05/12/22 12:54, robertoszek ha scritto:

Does running version 1.1.1rc52 make any difference?

Will try.

For now I'm getting a bunch HTTP 403 (it's not protected accounts) like

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://api.twitter.com/1.1/statuses/show.json?id=1599735346824183808&include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1& include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&includ e_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweet=true&query_source=typed_query&pc=1&spelling_corrections=1&ext=mediaStats%2Chighlighted Label

nemobis commented 1 year ago

Oh wait, that was with the token. The errors seem to have vanished (for now) after commenting the token in the config.

nemobis commented 1 year ago

Something weird is going on... this is a quote tweet https://nitter.lacontrevoie.fr/i/status/1597718716837335040 but it was posted under the quoted account @.***_farage/109462766690805431 (which doesn't seem to have retweeted it).

I also get HTTP 404 errors for tweets which used to exist but no longer do:

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.twitter.com/1.1/statuses/show.json?id=1483340344339091456&include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1&include_followed_by=1&include_want_retweets=1&include_mute_edge=1&include_can_dm=1&include_can_media_tag=1&skip_status=1&cards_platform=Web-12&include_cards=1&include_ext_alt_text=true&include_quote_count=true&include_reply_count=1&tweet_mode=extended&include_entities=true&include_user_entities=true&include_ext_media_color=true&include_ext_media_availability=true&send_error_codes=true&simple_quoted_tweet=true&query_source=typed_query&pc=1&spelling_corrections=1&ext=mediaStats%2ChighlightedLabel

On the wayback machine that redirects to https://web.archive.org/web/20220117161446/https://twitter.com/AndrzejHalicki/status/1483110270675439620 . So maybe 1483340344339091456 was a RT of 1483110270675439620 and the latter has been deleted.

robertoszek commented 1 year ago

Weird, 1597718716837335040 seems to only show up on the search API endpoint, doing the same query here:

https://twitter.com/search?q=(from%3ANigel_Farage)%20since_time%3A1669593600%20until_time%3A1670307727%20include%3Anativeretweets&src=typed_query&f=live

doesn't seem to include it on the results. You would think when using from:account wouldn't return quotes from other random accounts 😅 (and it only does it on the API endpoint it looks like).

I've added another pass to filter any tweets that don't originate from the mirrored user, just in case. https://github.com/robertoszek/pleroma-bot/commit/b70327d8c20a0d0239cb86f9ebc20cc8ab1d8efd

Regarding the 404's, I tried replicating on my end to no avail (reply to a deleted tweet, reply to a tweet that quotes a deleted tweet and a retweet to a deleted tweet didn't trigger it for me). I've done some changes trying to handle it nonetheless: https://github.com/robertoszek/pleroma-bot/commit/812e94b86657b3db359ad8a2629ede4c3c489d35

Both commits are included on 1.1.1rc53. Let me know if it stills results on unhandled errors on your end.

robertoszek commented 1 year ago

Oh, and the weird 403's you were getting when providing the token should be fixed on 1.1.1rc54. There was a parameter that resulted on 403 Client is not authorized to perform this action even when using a Elevated Twitter token (but it was fine with guest tokens): https://github.com/robertoszek/pleroma-bot/commit/a0e01b81a13bc997a328b85db2d37f6ae6de839f

nemobis commented 1 year ago

Everything is going well so far with 1.1.1rc54 (using a token, not guest tokens): it's all very fast and the process gets stuck very rarely (I don't even notice it) so the mirror is never too far behind.

The one error I see in the last day or so, apart from errors on my side, is

   1 requests.exceptions.HTTPError: 503 Server Error: Service 

Unavailable for url: https://api.twitter.com/2/tweets/1600535989239574530?poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics%2Calt_text&expansions=attachments.poll_ids%2Cattachments.media_keys%2Cauthor_id%2Centities.mentions.username%2Cgeo.place_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Cpossibly_sensitive%2Creferenced_tweets%2Csource%2Ctext%2Cwithheld

the tweet seems fine: https://nitter.cz/MounirSatouri/status/1600535989239574530

robertoszek commented 1 year ago

Oh, I forgot to mention I added some retries for cases when an HTTP 503 is returned by Twitter's API: https://github.com/robertoszek/pleroma-bot/commit/3ffe5f345cf7ed725ac806bd536ee4772ad9c569

It was included in the latest stable release, v1.2.0.

Not much else we can do than to retry a few times, usually Twitter's API starts returning 503 if their servers are overloaded or over capacity at the time of the request.