junaidiiith / Apertium_Code

0 stars 0 forks source link

superblanks should be merged where possible #5

Open unhammer opened 8 years ago

unhammer commented 8 years ago

current README example gives

$ cat deformatter_output.txt 
[<div1>]
 [<p2>]
 [<i3>]hulo [<i4>]broo 
 [<u5>]how [<u6>]
 [{<b7><u8>}]you [<u9>]
 [{<em10><u11>}]doin' 
 [</p>]
 [</div>]

This should be

[<div1>
 <p2>
 <i3>]hulo [<i4>]broo[ 
 <u5>]how [<u6>
 ][{<b7><u8>}]you [<u9>
 ][{<em10><u11>}]doin'[][ 
 </p>
 </div>]

Ie.

  1. any spaces longer than a single space are surrounded in brackets,
  2. brackets of the same type are merged, and
  3. there's a [] before EOF or before closing paragraphs (closing non-inline tags)

This may seem a bit arbitrary, but we want as few as possible unnecessary differences from the current apertium-deshtml, to make integration as easy as possible.

junaidiiith commented 8 years ago

Ok I'll fix this

junaidiiith commented 8 years ago

Superblanks merged!

unhammer commented 8 years ago

You removed the line breaks though. In html, a space vs non-space actually matters (although the number of spaces don't matter outside <pre> or </code>), while in TeX and other formats, a double line-break works as a paragraph separator.

Looking at input.html, and comparing with how apertium-deshtml works, I'd expect something like

[<div id="someid">
  <p class="some class" id="some id">
    ][{<i>}]hello brother[
    ][{<u style="italic">}]how[
    ][{<b>}]are you[
    ][{<u style="italic"><em>}]doing?[
  <\/p>
<\/div>
]
TinoDidriksen commented 8 years ago

What I do where structural whitespace matters is to put it into the tag, so that x <p> word </p> y becomes something like x <p outer-space-before=" " outer-space-after=" " inner-space-before=" " inner-space-after=" "> word </p> y - then the translation chain can freely mangle whitespace all it likes, because the post-processor can restore the exact spacing.

Dunno if that's at all relevant when superblanks can mostly do the same, but might help with some formats.

unhammer commented 8 years ago

Yeah, that sounds like a safe way to do it, though I think we should also build the rest of the chain so it still doesn't change plain spaces outside superblanks unless it's meaningful for translation.