Open Miss-Inputs opened 1 year ago
Determining the length of strings is surprisingly complicated. A trivial looking fix like
if len(s) > 128:
s = s[:125] + "..."
might fail, because Python's string type indexes by code points. Anyone fixing this should
"🏳️🌈"
is 1 grapheme, 4 code points, and 14 bytes. A good way to test this is to try setting the Discord "rich presence" details field to "🏳️🌈" * 128
, and if that fails, try "🏳️🌈" * 32
, and if that fails, try "🏳️🌈" * 9
. (It's also possible they're allowing 128 wide 2-byte units, with an internal utf-16 representation.)
Note that if you decide not to do grapheme splitting, you still need to be careful if there's a byte size limit not to break in the middle of a code point. Something like x.encode("utf-8")[:125].decode("utf-8", errors="ignore") + "..."
should suffice.
I can't work on this because I don't use Discord. Happy to help review a PR though.
Should this be solved upstream instead? Like self.discordrp.update(details=details, state=state, truncate=True, ...)
, ie by adding a new argument truncate
, always_fit
, or something like that.
Should this be solved upstream instead?
I think trying to get a fix in pypresence is probably the reasonable way to try to fix this, if they'll accept it.
Either way, roughly the same code is needed. E.g. if you care about not breaking graphemes (e.g. turning 🏳️🌈 into 🏳️ after truncation), you have to use a unicode aware grapheme splitting library. And best to figure out what Discord's actual limit is (128 bytes? 128 characters? 128 code points? 128 wchar_t
? Something else?) with actual testing.
(On the one hand, I find it surprising it is not possible to detect incomplete graphemes. On the other hand, I think this could potentially lead to hilarious misunderstandings.)
I'll admit I went by the "128 characters" message and had tested with Latin and Japanese characters, and did not bother to test with emoji and flags. I ran more tests:
"a" 128 -> OK "a" 129 -> Nope "λ" 128 -> OK "λ" 129 -> Nope "シ" 128 -> OK "シ" 129 -> Nope "😺" 128 -> Nope "😺" 64 -> OK "😺" 65 -> Nope "🏳️🌈" 128 -> Nope "🏳️🌈" 21 -> OK "🏳️🌈" 22 -> Nope
I'm not too familiar with Unicode encoding, so I'm all ears for suggestions.
Would it make more sense to truncate at 127 characters instead, or 126 (whatever is the maximum amount of characters where you can guarantee that adding an ellipsis would not reach 128 bytes in any reasonable scenario), to avoid the issue entirely? Gets the message across (all my friends will know I am very cool for listening to a certain song) without running into edge cases. Although, I'm not sure what song titles are that long and have a gay pride flag in their title anyway… now I have to find one LOL
Well going by my tests the limit seems to be 21 gay pride flags, and we can't truncate everything down to 21 now can we? I'm not sure but Wiki suggests flag emojis take up 2 code points and hence 8 bytes? It doesn't seem to match up with the rest of the tests where each "character" take up (1, 2, 3, 4) bytes assuming UTF-8?
I'm not sure but Wiki suggests flag emojis take up 2 code points and hence 8 bytes? It doesn't seem to match up with the rest of the tests where each "character" take up (1, 2, 3, 4) bytes assuming UTF-8?
In UTF-8 "characters" (meaning graphemes) can use an unlimited number of bytes. You are thinking of code points - all UTF-8 code points take between 1 and 4 bytes. A fully qualified 🏳️🌈 flag takes 14 bytes in UTF-8 and 12 bytes in UTF-16.
Thank you for the tests. Based on your work, I believe I can say what the Discord limits are. They allow 256 bytes of UTF-16 encoded text, not including any byte order mark (BOM) at the beginning. That is 128 ASCII characters when encoded in UTF-16.
The best you can do without grapheme splitting looks something like this:
def trim_text_utf16(s, max_bytes=256):
# use _le tagged encoding to avoid BOM insertion
if len(s.encode("utf_16_le")) <= max_bytes:
return s
result = ""
byte_size = 0
for cp in s:
byte_size += len(cp.encode("utf_16_le"))
# 2 = len("…".encode("utf_16_le"))
if byte_size > max_bytes - 2:
return result + "…"
result += cp
Call the function like this:
>>> trim_text_utf16("a" * 1000)
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa…'
>>> trim_text_utf16("aaa" + "🏳️⚧️" * 1000)
'aaa🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️⚧️🏳️…'
Note the borked flag at the end - the result of splitting by code point instead of by grapheme.
This may be considered good enough given the relative rarity of long titles. I would present this information to the upstream developer (pypresence) and see if they want to fix it.
If they want to do grapheme splitting there are a few packages for it. I think regex
can technically do it although I don't know if it stays up to date with Unicode (IIRC it has the character classes hardcoded in its source code). The "right" way to do this is probably pyicu. Otherwise, I'd use the trim_text_utf16
function above.
Steps to reproduce
Cowgirl / Synaesthesia (original En Motion mix) / Deepest Blue (a cappella) (Underworld vs. The Thrillseekers vs. Deepest Blue)
by The Cut Up BoysExpected Output
If the resulting status is too long, it should be truncated to whatever Discord thinks is the maximum width
Actual Output
Discord status is not updated, and is still the previous song that was played or Paused or whatever it was before.
Test System
Which version of Quod Libet?
4.5.0 but Discord plugin was taken from Git commit 721ff2e3f2273537988b73fea7071b6390b1f34a
Which operating system
Xubuntu 22.04
If it's audio-related, what back-end?
N/A
Additional Information