Open hammera opened 6 years ago
This is the html file with producing wrong output
This is the html content with producing wrong output:
Innentől folytatódik a sima szöveg. Vajon ez behúzásos bekezdés lesz-e?
This is the hu.cfg file content:
outputFormat
cellsPerLine 32
linesPerPage 25
interpoint yes
emphasis all
braillePages yes
continuePages yes
pageSeparator yes
pageSeparatorNumber yes
numberBraillePages yes
backFormat html
backLineLength 70
hyphenate yes
formatFor textDevice
lineEnd \n
pageEnd \f
beginningPageNumber 1
paragraphs yes
printPages yes
printPageNumberAt top
braillePageNumberAt bottom
outputEncoding utf8
contents yes
lineFill '
topMargin 0.5
leftMargin 1
rightMargin 0.5
bottomMargin 0.5
paperHeight 11
paperWidth 9.5
braillePageNumber
mergeUnnumberedPages yes
pageNumberTopSeparateLine no
pageNumberBottomSeparateLine no
printPageNumberRange yes
ignoreEmptyPages yes
printPageNumbersInContents yes
braillePageNumbersInContents yes
translation
literaryTextTable hu-hu-g1.ctb,hyph_hu_HU.dic
compbrlTable hu-hu-comp8.ctb
uncontractedTable en-us-g1.ctb
mathtextTable hu-hu-g1.ctb
mathexprTable nemeth.ctb
editTable nemeth_edit.ctb
xml
xmlheader "<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
#entity (an entity definition for the DTD)
semanticFiles *,nemeth.sem
internetAccess no
newEntries yes
converterSem utd.sem
mode dotsIO
inputTextEncoding utf8
debug no
style document
linesBefore 0
linesAfter 0
leftMargin 0
firstLineIndent 0
#translationTable (a table name)
skipNumberLines no
format leftJustified
newPageBefore no
newPageAfter no
righthandPage no
braillePageNumberFormat normal
keepWithNext no
dontSplit no
orphanControl 0
newlineAfter yes
style arith style attribution format rightJustified style biblio style caption leftMargin 4 firstLineIndent 2 style code linesBefore 1 linesAfter 1 skipNumberLines yes format computerCoded style contentsheader linesBefore 1 format centered linesAfter 1 style contents1 firstLineIndent -2 leftMargin 2 format contents style contents2 firstLineIndent -2 leftMargin 4 format contents style contents3 firstLineIndent -2 leftMargin 6 format contents style contents4 firstLineIndent -2 leftMargin 8 format contents style dedication newPageBefore yes newPageAfter yes format centered style directions style dispmath leftMargin 2 style disptext leftMargin 2 firstLineIndent 2 style exercise1 leftMargin 2 firstLineIndent -2 style exercise2 leftMargin 4 firstLineIndent -2 style exercise3 leftMargin 6 firstLineIndent -2 style glossary firstLineIndent 2 style graph skipNumberLines yes style graphlabel style heading1 linesBefore 1 format centered linesAfter 1 keepWithNext yes dontSplit yes
style heading2 linesBefore 1 firstLineIndent 4 style heading3 firstLineIndent 4 style heading4 firstLineIndent 4 style index style line firstLineIndent -2 leftMargin 2 style list firstLineIndent -2 leftMargin 2 style matrix format alignColumnsLeft style music skipNumberLines yes style note style para firstLineIndent 2 style quotation linesBefore 1 linesAfter 1 style section firstLineIndent 4 style spatial style stanza linesBefore 1 linesAfter 1 style style1 style style2 style style3 style style4 style style5 style subsection firstLineIndent 4 style table linesBefore 1 linesAfter 1 style titlepage newPageAfter yes style trnote firstLineIndent 7 leftMargin 5 style volume style boxline topBoxline c bottomBoxline c
This is the hyphenate.py file content:
import louis, sys def hyphenate_word(word): try: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask))) hyphenated_word=temp except RuntimeError: slice=word.split('-') temp_hyphenated_word='' for l in slice: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-' temp_hyphenated_word=temp_hyphenated_word+temp hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1] return hyphenated_word
word=sys.argv[1] hyphenated_word=hyphenate_word(word) print('normal word: '+word) print('hyphenated word: '+hyphenated_word)
This is the wrong test.brf content part: $innent7l fotat9dik a sima 5qveg. $vajon e2 beh02"sos beke- 2d1s le5-e?
With bekezdés word need hyphenate with be-2d1s word, because the bekez- part not fitting the 32 character line length.
So, with interlnal louis.hyphenate function the bekezdés word right places hyphenated (be-kez-dés).
With hungarian grade2 braille both us-table.dis, de-eurobrl6.dis and unicode.dis file usage is OK, except the bekezdés word hyphenated with bekez- word, and if I see right, the line length greater with 32 character.
Attila
Hi List,
In 2017 Norbert and me founded an interesting situation when using file2brl with following parameters: file2brl -f hu.cfg -t test.html test.brf If anybody would like trying reproducing or fix this issue, I attaching four files: test.htm: this is the small source html document, with I cutted the affected HTML part. test.brf: this is the wrong way generated hungarian grade1 braille document, with containing the 29TH line the wrong hungarian hyphenation part. hu.cfg: this file containing my hungarian language specific preferences for file2brl.
In Linux anybody succesfully reproduce this issue if copying the hu.cfg file into /usr/share/liblouisutdml/lbu_files directory, and type following command: file2brl -f hu.cfg -t test.htm test.brf
In the generated test.brf document 29TH line the file2brl utility wrong hyphenate the "bekezdés" word part. This situation the hyphen character lands in the 29TH line with 32TH character position.
With Liblouis I verifyed what parts possible hyphenate hungarian language the bekezdés word, following parts resulting good hyphenation: be-kez-dés Because the lou_checkhyphens utility impossible to test the bekezdés word because this word containing accented character, I wrote a small python script to easy test any words in hungarian language. The code is following:
!/usr/bin/env python3
-- coding: utf-8 --
import louis, sys def hyphenate_word(word): try: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask))) hyphenated_word=temp except RuntimeError: slice=word.split('-') temp_hyphenated_word='' for l in slice: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-' temp_hyphenated_word=temp_hyphenated_word+temp hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1] return hyphenated_word
word=sys.argv[1] hyphenated_word=hyphenate_word(word) print('normal word: '+word) print('hyphenated word: '+hyphenated_word)
If I run python3 hyphenate.py bekezdés command, I get following right output: "normal word: bekezdés hyphenated word: be-kez-dés" I attaching this small test program too.
Liblouis builtin hyphenate function confirming me the generated beke- hyphenation part is not valid. In the 29TH line the first right hyphenate part with fit the maximum 32 character line length is "be-", and need putting the next line the "kezdés" word part. The affected text part right braille output after manual correction is following in eurobraille format in hungarian grade1 braille: "5qveg. $vajon e2 beh02"sos be- ke2d1s le5-e?"
How can possible preventing this situation with automatic braille conversion? How can possible for example backlisting this wrong hyphenation if Liblouis part generating good hyphenation masks this word? Small texts easy correcting this type errors, but a large document when the purpose is a printable braille book, It is a very tedious task with document corrector persons. Have big chance a large text possible happening more this type issues.
I attaching the affected files. Attila