nkoriyama / aribb24

A library for ARIB STD-B24, decoding JIS 8 bit characters and parsing MPEG-TS stream.
GNU Lesser General Public License v3.0
60 stars 25 forks source link

Add support for latin encondig according to Brazilian Digital Television System (SBTVD / ISDB-Tb). #16

Open andrelcm opened 4 years ago

andrelcm commented 4 years ago

Dear maintainers,

This pull request supports latin enconding for Brazilian Digital Television System (SBTVD / ISDB-Tb) accorind to official standard ABNT 15606-1 (2013), related to issue #8. This document is portuguese, so I will try to translate the relevant parts.

11.4 Character encoding 11.4.1 8 bits character codes The character encoding using 8 bits must comply with ARIB STD-B5 and the technique described in ARIB STD-B24:2007, volume 1, subsection 7.1, with the the adaptations to include latin characters, as follows. The coding structure used by SBTVD must comply with the technique described in ARIB STD-B24:2007, volume 1, part 2, subsection 7.1.1.1, and the following changes: a) inclusion of character codes “latin extension” to the GP character codes. Table 13 presents the character code “latin extension” and table 9 presents the special codes for GP character codes; b) changing the initial state of the GL page to “alphanumeric” and changing the initial state of the GR page to “latin extension” (see Figure 6). Invocation and designation methods should not be used in the system diffusion Brazilian; c) classification of the set of codes and final bytes according to Table 15; d) inclusion of the graphic set of Latin characters (latin extension) and special characters according to to Table 15. d) inclusão do conjunto gráfico de caracteres latinos (latin extension) e caracteres especiais de acordo com a Tabela 15. NOTE 1 Table 13 was adapted from ISO / IEC 8859-15: 1999. NOTE 2 Table 15 presents the modified excerpt from Table 7-3 of ARIB STD-B24: 2007 for SBTVD.

11.6 Captions and overlapping characters The encoding of subtitles and overlapping characters must comply with the method described in ARIB STD-B24: 2007, volume 1, part 3, with the following change: -- change of the initial state of the system (presented in ARIB STD-B24: 2007, Volume 1, Part 3, Table 8-2) according to the values presented in Table 16; -- use of G0 and G2 as an initial state; -- G3 is used by the SS3 code (0x1D). SS3 means invoking a G3 code by placing it in the GL area temporarily.

Table 16 specififies the following desginations:

Thus, I created a initialization for latin decoder with following code: decoder->handle_g0 = decoder_handle_alnum_latin; decoder->handle_g1 = decoder_handle_alnum_latin; decoder->handle_g2 = decoder_handle_latin_extension; decoder->handle_g3 = decoder_handle_latin_special;

Since there is no technique specified to change to these designations, I included a code in 'parse_caption_management_data' to detect the language. If it is portuguese or spanish (this standard is also used in Argentina), an attribute in the instance is set to define the use of latin initialization. I tested the code with VLC and the files suplied in #15. It seems that the japanese portion is not broken, but I am not sure since I don't know japanese. If you want, you can also test with a dump I made from a brazilian broadcast: test.ts.

Regards, André Moreira

fcartegnie commented 4 years ago

Latin already means something else compared to ASCII. Do we need to name encoding "latin" ? There's no more specific naming ? Maybe we should name by standard ?

andrelcm commented 4 years ago

"Latin" is the term employed by the standard. Not sure if it is the best one, but I'd be ok to use another word.

andrelcm commented 4 years ago

Hi, Cartegnie. Do you want me to change something?