junaidiiith / Apertium_Code

0 stars 0 forks source link

how to delimit inline blanks? #7

Closed unhammer closed 8 years ago

unhammer commented 8 years ago

Consider the html input <i>de jour?</i>. This may be tokenised by lt-proc as "de jour" and "?" (with the apertium-deshtml|lt-proc we get [<i>]^de jour/de jour<adv>$^?/?<sent>$[<\/i>], but our deformatter doesn't (and can't) know where the token borders are.

If we turn this into [{<i>}]de [{<i>}]jour? then lt-proc will fail to notice the multiword expression (since there's more than just a simple space in between); it'll basically break all formatted multiwords.

But what's worse is the "?" – if we just "split on spaces" and turn <i>foo?</i> Bar into [{<i>}]foo? Bar , how does lt-proc know that it's supposed to output [{<i>}]^foo/foo<ij>$[{<i>}]^?/?<sent>$ ^Bar/Bar<n>$ and not [{<i>}]^foo/foo<ij>$^?/?<sent>$ ^Bar/Bar<n>$? (How does it know that the ? also is in italics?).

This is also an issue in generation – if the end of the pipeline is foo [{<b>}]bar! we don't know if that's foo <b>bar</b>! or foo <b>bar!</b>.

The simplest solution is that we treat inline blanks as unclosed until the next [. So when we reach the end-tag </i>, we have to we have to start on a superblank, even if the next character is a non-blank (if it is, we immediately close the superblank). So

unhammer commented 8 years ago

(Note: currently the lt-proc tokenisation will not accept as words-with-spaces anything that has a superblank in the middle, which is why I wrote [{<i>}]de[] jour? Yes →translate→ <i>av</i> dagen? Ja – this is not ideal, but not a huge deal either, since most people don't put italics in the middle of multiword expressions … it's something we might fix in lt-proc, but very low priority.)

unhammer commented 8 years ago

Updated http://wiki.apertium.org/wiki/Reordering_superblanks#Possible_solution taking this into account.

junaidiiith commented 8 years ago

In this 4th example if we close inline tags with [],why haven't we used [] to signify closing of 'i' tag and should there be a '[]' before closing the 'div' tag?

unhammer commented 8 years ago

In this 4th example if we close inline tags with [],why haven't we used [] to signify closing of 'i' tag and should there be a '[]' before closing the 'div' tag?

Because the inline tag is closed by the first following '[' (in this case, the [<div>]). The regular non-inline superblank already has a closing tag, looks like [</div>].

junaidiiith commented 8 years ago

Please check the updated deformatter. I have used '[]' for EOF and ending of non-inline tags and also to delimit the inline tags. So according to my deformatter <i>de jour?</i><div>Yes</div> --> deform --> [{<i>}]de jour?[<div>]Yes[][</div>]. Though the exact input as <i>de jour?</i><div>Yes</div> is not working because libxml parser gives parsing error but if used the input as `

de jour?

Yes

`

It gives the output: `[ ][{}]de jour?[

]Yes[][<\/div> <\/p>]` Please review this.
unhammer commented 8 years ago

I get

$ echo ' <p> <i>de jour?</i> <div>Yes</div> </p>'  >/tmp/f && ./deform /tmp/f && cat temp.txt 
[<p1> ][{<i2>}]de jour?[][ ][<div3>]Yes[<\/div>][ ][<\/p>]

$ ./reform 
<p> <i>de jour?</i> <div>Yes</div> </p>

so there are some unnecessary open/close-brackets after the "?", but otherwise it looks good, in particular

$ echo ' <p> <i>de jour</i>? <div>Yes</div> </p>'  >/tmp/f && ./deform /tmp/f && cat temp.txt 
[<p1> ][{<i2>}]de jour[]? [<div3>]Yes[<\/div>][ ][<\/p>]

has the [] on the right spot. (The final close tags still need merging though.)

junaidiiith commented 8 years ago

Uh Unhammer the temp.txt doesn't hold the output. The output is in deformatter_output.txt. Use this: echo ' <p> <i>de jour?</i> <div>Yes</div> </p>' >/tmp/f && ./def /tmp/f && cat deformatter_output.txt.

unhammer commented 8 years ago

Why does it create temp.txt every time it's run then? (Please make it write the main content to stdout!) Also, deformatter_output.txt doesn't have the i-tag closed at the right spot:

$ echo ' <p> <i>de jour</i>? <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f && cat deformatter_output.txt
[<p1> ][{<i2>}]de jour? [<div3>]Yes[<\/div> <\/p>]
$ echo ' <p> <i>de jour?</i> <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f && cat deformatter_output.txt
[<p1> ][{<i2>}]de jour?[ <div3>]Yes[<\/div> <\/p>]
$ echo ' <p> <i>de </i>jour? <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f && cat deformatter_output.txt
[<p1> ][{<i2>}]de jour? [<div3>]Yes[<\/div> <\/p>]
junaidiiith commented 8 years ago

Oh i fixed the positioning of the inline tag. And the temp.txt file actually adds '[]' to the closing of all the inline and tags and doesn't merge the closing tags, which in turn is acted upon to merge the closing tags.(both inline and non inline). This is probably not a good coding practice which I will fix, but I was first focusing on getting the work done. The output is now redirected to the stdout

unhammer commented 8 years ago
$ echo ' <p> <i>de jour?</i> <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f 
[<p1> ][{<i2>}]de jour?[ <div3>]Yes[<\/div> <\/p>]
$ echo ' <p> <i>de jour</i>? <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f 
[<p1> ][{<i2>}]de jour[]? [<div3>]Yes[<\/div> <\/p>]
$ echo ' <p> <i>de jour</i>?  <div>Yes</div> </p>' >/tmp/f && ./deform /tmp/f 
[<p1> ][{<i2>}]de jour[]?[  <div3>]Yes[<\/div> <\/p>]

looking good :) This should probably go into a test set; I'll close this issue for now since the other missing stuff is not quite related.