liblouis / liblouisutdml

An open-source library providing complete braille transcription services for xml, html and text documents
http://liblouis.io
GNU General Public License v3.0
24 stars 16 forks source link

Stray non-breaking space in BRF output #82

Open rbeezer opened 2 years ago

rbeezer commented 2 years ago

I'm getting what I think is a stray non-breaking space in BRF output.

  1. I apply file2brf (Version 2.11.0) to an HTML file purpose-built for translation via this method.

  2. HTML contains

    <div data-braille="tableofcontents">Contents</div>
  3. Semantic file contains

    contentsheader div,data-braille,tableofcontents
  4. Output BRF has

,3t5ts

as the ToC header, where there is a single U+00A0 after the final "s" and before the newline. Clearly visible in my pager (less) and by other means.

I looked through source but couldn't see where a change could be made to test, and a pull request formulated.

Thanks for any help you can provide, this is causiing me to use an incorrect encoding in a Python program that parses the BRF.

https://github.com/PreTeXtBook/pretext/blob/d402bdb3613d95984708150abe2fdb33123f565a/pretext/pretext.py#L2209

bertfrees commented 2 years ago

Hi Rob, I've transferred this issue to the liblouisutdml repository because I think it's unlikely that this is a Liblouis issue.

Perhaps what would help to track down this issue is a (minimal) test with input HTML, configuration files (ini, cfg and sem files), translation tables and command line arguments.

rbeezer commented 2 years ago

Thanks, Bert. I forgot there are two repositories. :-( Of course, I should have been poking around in this one.

I'll dig a bit deeper, and as a last resort construct a minimal example.

rbeezer commented 1 year ago

It is not visible here, but there is a non-breaking space (U+00A0) that is output immediately after Contents. So you will need to produce the output and examine the nature of the "extra" character.

Looks like the format centered in style contentsheader is to blame.

Minimal example attached.

contents-space.zip

Use

file2brl -f minimal.cfg source.html

Output is

                ,3t5ts 
,f/ ,divi.n

                                      #a
  ,f/ ,divi.n
,"s 3t5t4

                                      #a