muzuiget / dualsub-support

Dualsub - Dual Subtitles for YouTube
https://www.dualsub.xyz/
280 stars 23 forks source link

RTL Languages (such as Hebrew, Arabic) are aligned as LTR #577

Closed idanrm closed 7 months ago

idanrm commented 8 months ago

Subtitles translated to Hebrew and Arabic, which are right-to-left languages, are rendered incorrectly. For example, the punctuation appears wrong and sometimes the last part of the sentence appears before the first part, etc.

Thanks.

muzuiget commented 8 months ago

The issue is somewhat complex, as I lack extensive experience in this area.

Displaying Arabic, a right-to-left script, on browsers involves multiple methods, each suited to different scenarios. The complexity increases when mixing with left-to-right scripts like English.

You need to specify the type of subtitles in question: are they original subtitles provided by the website or machine-translated subtitles?

If they are the website's original subtitles, is the website's native subtitle functionality working correctly?

idanrm commented 8 months ago

It happens in all cases, original subtitles and machine translation. For example, if there are original, auto-generated or machine-translated Arabic subtitles in a YouTube video, YouTube would show them correctly, while Dualsub renders the same subtitles in all cases incorrectly. When the direction of the text is changed (for example, direction: rtl), it would show it correctly when there is a mix between, say, English and Arabic text.

When inspecting in YouTube, I have noticed that they use dir="rtl" attribute in the <div> component for subtitles in Arabic/Hebrew:

A good method to see if it renders the text correctly without knowing the language, would be to take a sentence that ends with a dot, and to check whether the dot appears on the right side rather than on the left side (it should appear on the left). In RTL languages, a dot on the right side of the sentence is actually a dot in the beginning of the sentence (which is wrong).

muzuiget commented 7 months ago

Previously, I encountered a similar issue #337, and I simply made modifications based on the suggestions from the questioner as I didn't have knowledge in this area.

However, now with ChatGPT, I have a deeper understanding in this area.

The modification method in #337 is incorrect. HTML &rlm; indeed needs to be converted to U+200F, not U+202B. This is because "semantic strings" and "visual strings" should not be confused:

When extracting semantics from caption files .vtt, it is indeed necessary to use U+200F.

When there is a need to "draw" a "semantic string," there are two methods:

muzuiget commented 7 months ago

When you select Arabic/Hebrew in the YouTube native subtitle menu, it indeed inserts the dir="rtl" attribute.

The Dualsub "Native Mode" simply merges the text of two subtitles and does not control the subtitle HTML. This means Dualsub cannot modify HTML based on language because the YouTube subtitle code continuously refreshes (or regenerates) the subtitle HTML.

muzuiget commented 7 months ago

In the Unicode bidirectional writing direction property, punctuation belongs to "Neutral characters." Its display direction is determined by several factors, including the direction of preceding and following characters, the direction of the entire sentence, or even the default direction of the entire webpage.

Therefore, I tend to preserve the original state and let users use CSS to precisely control the writing direction of each sentence.

muzuiget commented 7 months ago

Here is the CSS to fix the direction.

/* native mode */
.caption-visual-line:nth-child(1) {
   direction: rtl;
}
.caption-visual-line:nth-child(2) {
   direction: ltr;
}

/* standard mode */
.dualsub-renderer .subtitle-1 {
   direction: rtl;
}
.dualsub-renderer .subtitle-2 {
   direction: ltr;
}

Before:

Screenshot_20240214_130528

After:

Screenshot_20240214_130624

idanrm commented 7 months ago

Thank you, it works. However, I think it is a bit cumbersome to add this code everytime we watch a video, it doesn't fit for everyday use. Isn't there a way for the extension to inject this CSS code whenever the selected language is an RTL language? (there are lists of RTL language codes on Google)

muzuiget commented 7 months ago

There are two ways: create a custom extension, or use userscript extension.

1. Create a custom extension

Create a manifest.json file:

{
    "manifest_version": 3,
    "name": "Bidi",
    "version": "0.1.0",
    "content_scripts": [
        {
            "matches": [
                "https://www.youtube.com/*"
            ],
            "css": [
                "styles.css"
            ]
        }
    ]
}

Then create a styles.css in the same directory, the content is the CSS code I posted above.

2. Use userscript extension

I also write a small userscript extension WildMonkey.

Screenshot_20240215_005245

idanrm commented 7 months ago

Thank you, it works well. For me (and for anyone who will see this issue) the problem is solved, however, why isn't it possible to integrate this CSS change into the Dualsub extension itself, whenever an RTL language is selected?

Thanks

muzuiget commented 7 months ago

The actual situation is quite complicated.

This is because sometimes the subtitle text is not wrapped in a single HTML, for example, Dualsub supports Chinese and Japanese phonetic annotations.

Screenshot_20240220_233726

When subtitles encounter errors, English error messages will be displayed, the subtitle is switching back to English.

The solution mentioned above just happens to resolve a specific case.

Therefore, I prefer keep the default behavior of the browser. For example, if your browser UI language is in Hebrew, then what actually needs to be corrected is the positioning of punctuation in the English subtitles.