Make URLs count as 23 characters for Mastodon

nemobis commented 1 year ago

I've seen this a few times though it's rare: https://respublicae.eu/@nilstorvalds/109377434782905923

The link was truncated and the last word replaced with ….

robertoszek commented 1 year ago

It looks like the post is 507 characters long, so it was truncated as Mastodon's maximum is 500.

However, URLs (even unshortened ones) actually count as only 23 characters in Mastodon: https://docs.joinmastodon.org/client/guidelines/

So, adjusting the truncating logic on the bot to account for this would be the best course of action. In the case of this particular post it would have brought down the character count to 289, which wouldn't result in truncating it.

The problem is, even after accounting for this, in some extreme cases like quote tweets they could go over the character limit on Mastodon. Twitter's limit is 280 characters, so just the body of a quote tweet could potentially be 560 characters long (280 * 2). Making truncation the only option there and potentially creating dead links in the process. Not much we can do there I can think of right now.

Anyway, counting URLs as 23 characters internally on the bot would at least mitigate this issue somewhat and make it run into it less frequently.

nemobis commented 1 year ago

Il 20/11/22 21:44, robertoszek ha scritto:

However, URLs (even unshortened ones) actually count as only 23 characters in Mastodon:

In recent instances, the exact amount (which might be configurable?) is advertised in the API https://respublicae.eu/api/v1/instance at .configuration.statuses.max_characters and .configuration.statuses.characters_reserved_per_url.

robertoszek commented 1 year ago

Right, we should honor those if they are present.

As always I need to find some time and finally get around to writing coverage tests for it, but this should do it: 70c91872cb0b17ee3a77e5703a2214c82bc2806c

Feel free to report back or provide any feedback if you encounter issues with it: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc19

nemobis commented 1 year ago

I think there might be something wrong with the regex, because I'm regularly getting errors of this sort:

2022-11-25 21:58:56,364 pleroma_bot ERROR: Exception occurred for user, skipping...
Traceback (most recent call last):
  File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/cli.py", line 589, in main
    tweets_to_post = user.process_tweets(tweets)
  File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/_processing.py", line 160, in process_tweets
    len_text = self._mastodon_len(tweet["text"])
  File "/home/7/federico/mastodon/bot/lib/python3.9/site-packages/pleroma_bot/_utils.py", line 538, in _mastodon_len
    text = re.sub(group, group[:char_count_url], text)
  File "/usr/lib/python3.9/re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python3.9/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.9/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.9/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.9/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.9/sre_parse.py", line 549, in _parse
    raise source.error("unterminated character set",
re.error: unterminated character set at position 190

nemobis commented 1 year ago

The nitter URLs are sometimes truncated too, specifically in RT with comment: https://respublicae.eu/@EC_StockholmRep/109406454535266575 .

robertoszek commented 1 year ago

The nitter URLs are sometimes truncated too, specifically in RT with comment: https://respublicae.eu/@EC_StockholmRep/109406454535266575 .

Oh, I see how that could've happened. Once we have processed the length in case of a signature or original date, we do a final check to see if it still needs to be truncated, which wasn't using the Mastodon's length to calculate it. 5fb6b96aeeb20075886f230c25807064bdf8ba3e

And the regex for finding the URLs seem to work but then fails for you when using the pattern of each URL to substitute them, so let's just try a simple string replace instead: 7e27c71a0eaf03fa1d9156e0c5583db12a63668e

I've added the changes to 1.1.1rc27: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc27

nemobis commented 1 year ago

Thanks, testing now.

robertoszek / pleroma-bot

Make URLs count as 23 characters for Mastodon #95