mantas-done / subtitles

Subtitle/caption converter
https://gotranscript.com/subtitle-converter
MIT License
142 stars 48 forks source link

Unicode problem. in SmiConverter.php #87

Closed hyoungki-kim closed 8 months ago

hyoungki-kim commented 8 months ago

Hello.

The unicode character broken, when load smi. This is the code.

@$doc->loadHTML($file_content); // silence warnings about invalid html

Insert this code please. Before loadHTML. like this...

$file_content = mb_convert_encoding($file_content, 'HTML-ENTITIES', "UTF-8");
@$doc->loadHTML($file_content); // silence warnings about invalid html
mantas-done commented 8 months ago

@hyoungki-kim Hi, can you also add the file that is having the problem. I will use it for the unit test.

hyoungki-kim commented 8 months ago

This is the file. (Language is korean) sbs-das_2023.zip

hyoungki-kim commented 8 months ago

You can also refer to a comment on php.net. That comment is as follows.

Text-encoding HTML-ENTITIES will be deprecated as of PHP 8.2. To convert all non-ASCII characters into entities (to produce pure 7-bit HTML output), I was using:

echo mb_convert_encoding( htmlspecialchars( $text, ENT_QUOTES, 'UTF-8' ), 'HTML-ENTITIES', 'UTF-8' );

I can get the identical result with:

echo mb_encode_numericentity( htmlentities( $text, ENT_QUOTES, 'UTF-8' ), [0x80, 0x10FFFF, 0, ~0], 'UTF-8' );

The output contains well-known named entities for some often used characters and numeric entities for the rest.

But...our $file_content is not HTML. So. This code is correct and works well.

$file_content = mb_encode_numericentity( $file_content, [0x80, 0x10FFFF, 0, ~0], 'UTF-8' );
mantas-done commented 8 months ago

Thank you. Updated the code and released a new package version.