jitsi / jigasi

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients to join meetings and provides transcription capabilities.
Apache License 2.0
525 stars 295 forks source link

Encoding problem: Lack of UTF-8 Support for JSON POST Requests in Transcription Module #505

Open VewMet opened 11 months ago

VewMet commented 11 months ago

Description

Transcripted languages appear as '?' other than english at SEND_JSON_REMOTE_URLS of jigasi module like (other than English) Hindi, it's crucial to ensure that the content is being sent and received using the UTF-8 character encoding to avoid any misinterpretation of characters.

Current behavior

whenever i have spoken in hindi, it hasn’t understood the non-ASCII and posted the '?' in my streams.

image

when sending the JSON data to the server, the character encoding is not explicitly set. By default, it might be using the system's default character encoding which might not be UTF-8

Expected Behavior

The transcription text should correctly represent the spoken content in any supported language without encoding issues.

Possible Solution

I've created a pull request that addresses this issue by ensuring the Content-Type header for JSON POST requests is explicitly set to application/json; charset=UTF-8. Additionally, I've ensured that the JSON string is converted to bytes using UTF-8 encoding before sending.

PR Link: https://github.com/jitsi/jigasi/pull/504

Steps to reproduce

  1. Set up Jigasi with transcription service.
  2. Use the transcription feature with a non-ASCII language, e.g., Hindi.
  3. Observe the returned transcription text containing unexpected characters or question marks.
  4. org.jitsi.jigasi.transcription.SEND_JSON_REMOTE_URLS=<remote json accepting url>