liblouis / liblouisutdml

An open-source library providing complete braille transcription services for xml, html and text documents
http://liblouis.io
GNU General Public License v3.0
24 stars 16 forks source link

file2brl generating a wrong hyphenated word with hungarian eurobraille document #48

Open hammera opened 6 years ago

hammera commented 6 years ago

Hi List,

In 2017 Norbert and me founded an interesting situation when using file2brl with following parameters: file2brl -f hu.cfg -t test.html test.brf If anybody would like trying reproducing or fix this issue, I attaching four files: test.htm: this is the small source html document, with I cutted the affected HTML part. test.brf: this is the wrong way generated hungarian grade1 braille document, with containing the 29TH line the wrong hungarian hyphenation part. hu.cfg: this file containing my hungarian language specific preferences for file2brl.

In Linux anybody succesfully reproduce this issue if copying the hu.cfg file into /usr/share/liblouisutdml/lbu_files directory, and type following command: file2brl -f hu.cfg -t test.htm test.brf

In the generated test.brf document 29TH line the file2brl utility wrong hyphenate the "bekezdés" word part. This situation the hyphen character lands in the 29TH line with 32TH character position.

With Liblouis I verifyed what parts possible hyphenate hungarian language the bekezdés word, following parts resulting good hyphenation: be-kez-dés Because the lou_checkhyphens utility impossible to test the bekezdés word because this word containing accented character, I wrote a small python script to easy test any words in hungarian language. The code is following:

!/usr/bin/env python3

-- coding: utf-8 --

import louis, sys def hyphenate_word(word): try: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask))) hyphenated_word=temp except RuntimeError: slice=word.split('-') temp_hyphenated_word='' for l in slice: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-' temp_hyphenated_word=temp_hyphenated_word+temp hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1] return hyphenated_word

word=sys.argv[1] hyphenated_word=hyphenate_word(word) print('normal word: '+word) print('hyphenated word: '+hyphenated_word)

If I run python3 hyphenate.py bekezdés command, I get following right output: "normal word: bekezdés hyphenated word: be-kez-dés" I attaching this small test program too.

Liblouis builtin hyphenate function confirming me the generated beke- hyphenation part is not valid. In the 29TH line the first right hyphenate part with fit the maximum 32 character line length is "be-", and need putting the next line the "kezdés" word part. The affected text part right braille output after manual correction is following in eurobraille format in hungarian grade1 braille: "5qveg. $vajon e2 beh02"sos be- ke2d1s le5-e?"

How can possible preventing this situation with automatic braille conversion? How can possible for example backlisting this wrong hyphenation if Liblouis part generating good hyphenation masks this word? Small texts easy correcting this type errors, but a large document when the purpose is a printable braille book, It is a very tedious task with document corrector persons. Have big chance a large text possible happening more this type issues.

I attaching the affected files. Attila

hammera commented 6 years ago

This is the html file with producing wrong output

hammera commented 6 years ago

This is the html content with producing wrong output:

teszt

Innentől folytatódik a sima szöveg. Vajon ez behúzásos bekezdés lesz-e?

This is the hu.cfg file content:

This file contains all possible configuration settings, with sample

values, where appropriate. It is used by the file2brl command-line

interface if no configuration file is given. It is also part of the

documentation.

outputFormat

The number of cells on a line in Braille translations

cellsPerLine 32

The number of lines per page in Braille translations

linesPerPage 25

Whether to format the Braille translation for output on interpoint embossers

NOTE: Other than the formatting, there is nothing specific to interpoint

embossing. This means that even if this is set to no, the output can

still be embossed on an interpoint embosser

interpoint yes

What emphasis to include, comma separated list using values:

italic, bold, computerBraille and underline.

For all emphasis you may just use the value all

emphasis all

Whether to separate into Braille pages. If no then linesPerPage is ignored.

braillePages yes

When numbering print pages should continuation be indicated in page numbers.

For example first Braille page of print page 26 it will be a26,

second page will be b26, etc.

continuePages yes

Whether to include a print page separator mark in Braille at print page

breaks.

pageSeparator yes

Whether to include a page number on the page separator line.

pageSeparatorNumber yes

Include Braille page numbers

numberBraillePages yes

What format should be produced from back translations

backFormat html

Line length for files produced from backtranslation.

backLineLength 70

Hyphenate translations

hyphenate yes

What type of Braille device should output be formatted for.

formatFor textDevice

What characters mark a line ending, mostly relevant for text/brf format.

lineEnd \n

What character marks end of page, again mostly suitable for text/brf format.

pageEnd \f

What page number should Braille page numbers start from

beginningPageNumber 1

Whether to format paragraphs. If set to no then a paragraph is one long

line and cellsPerLine is ignored

paragraphs yes

Whether to show print page numbers

printPages yes

Where to place print page numbers

printPageNumberAt top

Where to place Braille page numbers

braillePageNumberAt bottom

Encoding of output file.

outputEncoding utf8

Whether to produce a table of contents

contents yes

The character to fill lines with (eg. in tables tracker dots)

lineFill '

The below settings for margins and paper dimensions are only used for UTD

output. When formatting for UTD cellsPerLine and linesPerpage are

ignored.

The margin at the top of the page in inches

topMargin 0.5

The margin at the left of the page

leftMargin 1

The margin at the right of the page

rightMargin 0.5

The margin at the bottom of the page

bottomMargin 0.5

Height of the Braille page in inches

paperHeight 11

Width of the Braille page in inches

paperWidth 9.5
braillePageNumber

If a print page has no page number, do not insert a page separator and

so merge it with the previous page in the Braille translation

mergeUnnumberedPages yes

Whether to place any page numbers at the top of a page on a separate line

pageNumberTopSeparateLine no

Whether to place any page numbers at the bottom of a page on a separate line

pageNumberBottomSeparateLine no

If a Braille page has more than one print page on it, whether to show the

range of print page numbers present on the Braille page.

printPageNumberRange yes

If there is an empty page in Print whether to ignore it in Braille.

ignoreEmptyPages yes

Whether to include print page numbers in table of contents

printPageNumbersInContents yes

Whether to include Braille page numbers in table of contents

braillePageNumbersInContents yes

translation

What Braille table to use for the literary text

literaryTextTable hu-hu-g1.ctb,hyph_hu_HU.dic

What table to use for computer Braille

compbrlTable hu-hu-comp8.ctb

What table to use for uncontracted Braille

NOTE: This setting is possibly depricated.

uncontractedTable en-us-g1.ctb

What table to use for non-mathematical content in books containing maths.

This option is normally not needed in many codes and so should be the

same as literaryTextTable.

mathtextTable hu-hu-g1.ctb

What Braille table to use for mathematical content

mathexprTable nemeth.ctb

What table should be used to edit together parts of documents (eg. to

join maths and text)

editTable nemeth_edit.ctb

xml

The XML header assumed for XML input documents with no header

xmlheader "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>

Entity definitions

#entity (an entity definition for the DTD)

The semantic action files to be used

semanticFiles *,nemeth.sem

Whether to use the internet to get DTDs

internetAccess no

Whether to create new semantic action definitions

newEntries yes

What semantic action file to convert from UTD.

converterSem utd.sem

(miscellaneous)

Directive for including other configuration files

The mode for translation

mode dotsIO

The input encoding of text files

inputTextEncoding utf8

Whether to use debug mode

debug no

You can override any style setting and define new styles.

A style name will normally match the semantic action name

Refer to the liblouisutdml documentation for details on possible options

which can be used in styles.

style document

This style contains all possible style settings.

linesBefore 0
linesAfter 0
leftMargin 0
firstLineIndent 0
#translationTable (a table name)
skipNumberLines no
format leftJustified
newPageBefore no
newPageAfter no
righthandPage no
braillePageNumberFormat normal
keepWithNext no
dontSplit no
orphanControl 0
newlineAfter yes

style arith style attribution format rightJustified style biblio style caption leftMargin 4 firstLineIndent 2 style code linesBefore 1 linesAfter 1 skipNumberLines yes format computerCoded style contentsheader linesBefore 1 format centered linesAfter 1 style contents1 firstLineIndent -2 leftMargin 2 format contents style contents2 firstLineIndent -2 leftMargin 4 format contents style contents3 firstLineIndent -2 leftMargin 6 format contents style contents4 firstLineIndent -2 leftMargin 8 format contents style dedication newPageBefore yes newPageAfter yes format centered style directions style dispmath leftMargin 2 style disptext leftMargin 2 firstLineIndent 2 style exercise1 leftMargin 2 firstLineIndent -2 style exercise2 leftMargin 4 firstLineIndent -2 style exercise3 leftMargin 6 firstLineIndent -2 style glossary firstLineIndent 2 style graph skipNumberLines yes style graphlabel style heading1 linesBefore 1 format centered linesAfter 1 keepWithNext yes dontSplit yes

style heading2 linesBefore 1 firstLineIndent 4 style heading3 firstLineIndent 4 style heading4 firstLineIndent 4 style index style line firstLineIndent -2 leftMargin 2 style list firstLineIndent -2 leftMargin 2 style matrix format alignColumnsLeft style music skipNumberLines yes style note style para firstLineIndent 2 style quotation linesBefore 1 linesAfter 1 style section firstLineIndent 4 style spatial style stanza linesBefore 1 linesAfter 1 style style1 style style2 style style3 style style4 style style5 style subsection firstLineIndent 4 style table linesBefore 1 linesAfter 1 style titlepage newPageAfter yes style trnote firstLineIndent 7 leftMargin 5 style volume style boxline topBoxline c bottomBoxline c

This is the hyphenate.py file content:

!/usr/bin/env python3

-- coding: utf-8 --

import louis, sys def hyphenate_word(word): try: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask))) hyphenated_word=temp except RuntimeError: slice=word.split('-') temp_hyphenated_word='' for l in slice: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-' temp_hyphenated_word=temp_hyphenated_word+temp hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1] return hyphenated_word

word=sys.argv[1] hyphenated_word=hyphenate_word(word) print('normal word: '+word) print('hyphenated word: '+hyphenated_word)

This is the wrong test.brf content part: $innent7l fotat9dik a sima 5qveg. $vajon e2 beh02"sos beke- 2d1s le5-e?

With bekezdés word need hyphenate with be-2d1s word, because the bekez- part not fitting the 32 character line length.

So, with interlnal louis.hyphenate function the bekezdés word right places hyphenated (be-kez-dés).

With hungarian grade2 braille both us-table.dis, de-eurobrl6.dis and unicode.dis file usage is OK, except the bekezdés word hyphenated with bekez- word, and if I see right, the line length greater with 32 character.

Attila