Encoding Error with ¿ character becoming Â¿ on Spoke Backend

bengilvar commented 1 year ago

Your issue may already be reported! Please search on the issue tracker before creating one.

[YES] I have searched through existing issues and did not find an existing report

Describe the bug A client wrote the following message, which from what I can tell is in GSM encoding: Hola {firstName}, soy Felipe y vivo en Redlands. Quieren prohibir vendedores ambulantes en Redlands. ¿Apoya protecciones para los vendedores y sus familias?

This is GSM according to both the twillio message segment estimator and spoke (see picture). However, the client was charged for 3 segments instead of one and we saw 3 segments on the backend (see attached csv) However, although we see "¿" in the spoke display, on the backend we see "Â¿" The Â seems to be making this not GSM as ¿ is a GSM character. See GSM characters: https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_alphabet_and_extension_table_of_3GPP_TS_23.038_.2F_GSM_03.38

To Reproduce Steps to reproduce the behavior:

Go to a Spoke campaign
Copy the above text into the interactions section for an initial message
Send the message
See # of segments on the backend

Expected behavior This message sends with 1 segment as expected, because all characters are GSM characters

Screenshots If applicable, add screenshots to help explain your problem. GSM

Desktop (please complete the following information):

Browser [chrome, i]

iedo query.csv

Additional context Add any other context about the problem here.

sync-by-unito[bot] commented 1 year ago

➤ Derrick Liu commented:

This is likely due to ¿ getting represented as Unicode, since Â routinely appears as the result of incorrectly decoding Unicode with ISO 8859-1 or Windows-1252 encodings. https://en.wikipedia.org/wiki/%C3%82#In_encoding_mismatches ( https://en.wikipedia.org/wiki/%C3%82#In_encoding_mismatches )

In UTF-8, the copyright symbol (©) is encoded with the hexadecimal ( https://en.wikipedia.org/wiki/Hexadecimal )bytes ( https://en.wikipedia.org/wiki/Byte )C2 A9. In the older Western encoding standards, however, the © symbol is simply A9. If a browser ( https://en.wikipedia.org/wiki/Web_browser ) is given the bytes C2 A9, intended to display © in UTF-8, but is led to parse the bytes according to one of the Western encodings, it will interpret the bytes C2 A9 as two separate characters. C2 corresponds to Â, as seen in the chart above, and A9 devolves to the © symbol, so the result seen by the person reading the page is Â©—that is, the correct © symbol but with an Â prepended.

mkoontz-rewired commented 1 year ago

@ajohn25 and I tested the message in the issue with both Telnyx and Bandwidth on staging. We copied the same message: Hola {firstName}, soy Felipe y vivo en Redlands. Quieren prohibir vendedores ambulantes en Redlands. ¿Apoya protecciones para los vendedores y sus familias? from a notion page when creating campaigns to test.

Bandwidth test:

@ajohn25 was nice enough to contact support and confirm that the message we sent containing the inverted question mark was calculated as 1 segment by Bandwidth.

Telnyx test:

After misunderstanding the issue description (I thought testing with the Twilio message segment estimator meant this user was sending over Twilio), I tested the message with Telnyx, the actual telecom provider used by this user. The delivery report confirmed that Telnyx calculated the message as 3 segments. Telnyx's definition of valid GSM-7 characters is different from Twilio's segment calculator:

{
    "0123456789"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "abcdefghijklmnopqrstuvwxyz"
    "\n\r !\"#\$%&'()*+,-./:;<=>?@[\\]^_{}|~"
}

from https://developers.telnyx.com/docs/v2/messaging/messages/resources/configuration_and_limitations/character_and_rate_limits/

So they're accurately charging us for their definition of valid GSM-7 characters.

Open Question:

Because we're moving most traffic to Bandwidth, the question is whether or not it's worth it at this time to update the client-side estimator to calculate estimated segments/encoding used based on what messaging service is being used. @ajohn25 thinks it's probably not worth it, but wondering if y'all (@hiemanshu, @bchrobot) agree.

If it is worth it, I'll go ahead and make fixing that part of this ticket. If it's not, I can either create a ticket for it and close this one or just close as is.

hiemanshu commented 1 year ago

I agree. I'd say close this as wontfix.

mkoontz-rewired commented 1 year ago

Sounds good. Closing and won't fix due this being Telnyx specific + migrating from using Telnyx.

politics-rewired / Spoke

Encoding Error with ¿ character becoming Â¿ on Spoke Backend #1595