nossebro / OpenAI

OpenAI chat bot integration
GNU General Public License v3.0
2 stars 1 forks source link

Encoding error #1

Closed v1k70rk4 closed 1 year ago

v1k70rk4 commented 1 year ago

Hello,

Firstly, thank you so much for your work! However, lately, there's been a disturbing error that I've been trying to figure out, but I can't seem to identify the issue.

The problem is that sometimes the UNICODE escape sequences are incorrectly converted to UTF8, and they appear as seen in the ###bad### example.

The issue comes up completely at random, and I'm unsure of how to handle it.

Thank you in advance for your response :)

2023-07-14 22:21:48,248  Execute(): DEBUG: Send to AI from ThungiBogi: 'Szoboszlai dominik hány éves?'
2023-07-14 22:21:48,254  OpenAIAPIPostRequest(): DEBUG: Parent.PostRequest(https://api.openai.com/v1/chat/completions, {'Authorization': 'Bearer ##token##'}, {'n': 1, 'model': 'gpt-3.5-turbo', 'max_tokens': 600, 'user': 'ThungiBogi', 'temperature': 0.70000000000000007, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'messages': [{'content': u'Te egy sz\xc3\xb3rakoztat\xc3\xb3 Twitch chatbot vagy, Gatto BOT a neved. A streamer neve Bal\xc3\xa1zs, Viktor az \xc3\xa9lett\xc3\xa1rsa \xc3\xa9s a moder\xc3\xa1tor. M\xc3\xa1sa a perzsa macsk\xc3\xa1juk.', 'role': 'system'}, {'content': u'Szoboszlai dominik h\xe1ny \xe9ves?', 'role': 'user'}], 'top_p': 1.0}, True
2023-07-14 22:21:49,517  OpenAIAPIPostRequest(): DEBUG: 200
2023-07-14 22:21:49,519  OpenAIAPIPostRequest(): DEBUG: {'choices': [{'message': {'role': 'assistant', 'content': u'Szoboszlai Dominik jelenleg 20 \xe9ves.'}, 'index': 0, 'finish_reason': 'stop'}], 'created': 1689366108, 'usage': {'completion_tokens': 15, 'prompt_tokens': 88, 'total_tokens': 103}, 'model': 'gpt-3.5-turbo-0613', 'object': 'chat.completion', 'id': 'chatcmpl-7cJaOf3Sz8lqthU0f8fqgGc3Rz8Kb'}
2023-07-14 22:21:49,524  split_text_into_sentences(): DEBUG: [u'Szoboszlai Dominik jelenleg 20 \xe9ves.']
##good##2023-07-14 22:21:49,525  Execute(): DEBUG: @ThungiBogi Szoboszlai Dominik jelenleg 20 éves.
2023-07-14 22:22:01,042  Execute(): DEBUG: Send to AI from ThungiBogi: 'Szoboszlai dominik hány éves? 2023 van.'
2023-07-14 22:22:01,044  OpenAIAPIPostRequest(): DEBUG: Parent.PostRequest(https://api.openai.com/v1/chat/completions, {'Authorization': 'Bearer ##token##'}, {'n': 1, 'model': 'gpt-3.5-turbo', 'max_tokens': 600, 'user': 'ThungiBogi', 'temperature': 0.70000000000000007, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'messages': [{'content': u'Te egy sz\xc3\xb3rakoztat\xc3\xb3 Twitch chatbot vagy, Gatto BOT a neved. A streamer neve Bal\xc3\xa1zs, Viktor az \xc3\xa9lett\xc3\xa1rsa \xc3\xa9s a moder\xc3\xa1tor. M\xc3\xa1sa a perzsa macsk\xc3\xa1juk.', 'role': 'system'}, {'content': u'Szoboszlai dominik h\xe1ny \xe9ves? 2023 van.', 'role': 'user'}], 'top_p': 1.0}, True
2023-07-14 22:22:03,561  OpenAIAPIPostRequest(): DEBUG: 200
2023-07-14 22:22:03,562  OpenAIAPIPostRequest(): DEBUG: {'choices': [{'message': {'role': 'assistant', 'content': u'Szoboszlai Dominik jelenleg 21 \xc3\xa9ves, sz\xc3\xbclet\xc3\xa9si d\xc3\xa1tuma 2000. okt\xc3\xb3ber 9. Ez azt jelenti, hogy 2023-ban m\xc3\xa1r 23 \xc3\xa9ves lesz.'}, 'index': 0, 'finish_reason': 'stop'}], 'created': 1689366121, 'usage': {'completion_tokens': 63, 'prompt_tokens': 93, 'total_tokens': 156}, 'model': 'gpt-3.5-turbo-0613', 'object': 'chat.completion', 'id': 'chatcmpl-7cJab2jRcIiuPjsCuGQIc4jfrPxaC'}
2023-07-14 22:22:03,563  split_text_into_sentences(): DEBUG: [u'Szoboszlai Dominik jelenleg 21 \xc3\xa9ves, sz\xc3\xbclet\xc3\xa9si d\xc3\xa1tuma 2000.', u'okt\xc3\xb3ber 9.', u'Ez azt jelenti, hogy 2023-ban m\xc3\xa1r 23 \xc3\xa9ves lesz.']
##bad##2023-07-14 22:22:03,563  Execute(): DEBUG: @ThungiBogi Szoboszlai Dominik jelenleg 21 éves, születési dátuma 2000. október 9. Ez azt jelenti, hogy 2023-ban már 23 éves lesz
v1k70rk4 commented 1 year ago

Does OpenAI use different encodings to represent special characters with accents?

  1. "Szoboszlai Dominik jelenleg 20 \xe9ves." - Here, the "\xe9" sequence represents the character "é". This format is typically used in the latin2 (ISO-8859-2) encoding.

  2. "Szoboszlai Dominik jelenleg 21 \xc3\xa9ves..." - In this log message, the "\xc3\xa9" sequence represents the character "é", which is the format used in the UTF-8 encoding.

Therefore, I still don't understand completely...