junaidiiith / Apertium_Code

0 stars 0 forks source link

html: standoff vs in-document markup #8

Open unhammer opened 8 years ago

unhammer commented 8 years ago

Currently, the output of the html deformatter is a mix between stand-off and in-document markup, e.g. [<p1> ][{<i2>}]de jour[]? [<div3>]Yes[<\/div> <\/p>] uses "p" but hides attributes. I think it should be one or the other, ie. either completely standoff [1][{2}]de jour[]?[3]Yes[4] or completely in-document [<p id="foo"> ][{<i ng-click="x">}]de jour[]? [<div class="meh">]Yes[<\/div> <\/p>].

I prefer in-document, since it doesn't require keeping temporary files around, which can be handy when developing (you can just pipe a single stream around), although I see the value in stand-off as well (easier to focus on content when seeing just part of the stream, for one).

Perhaps we could have both? E.g. the default is in-document as in apertium-deshtml, while if you run deshtml2 -s markup.txt < in.html >out.html then markup.txt contains definitions of references, while out.html is stand-off annotated.

junaidiiith commented 8 years ago

Yes we can do it both ways. Infact using just the 'ids' will be simpler to handle as we can just make a pair of 'id' and 'type'(inline or non-inline) and put it in the tags. Should I do it?

unhammer commented 8 years ago

You don't need a type=(inline|non-inline) in the stand-off document, the stream format gives you that. If you see [2] then you know it's non-inline, if you see [{1}] then you know it's inline; the stand-off document just needs a mapping from id's to content: 1 = <em class="bar"> 2 = <div id="foo"><p><p style="horrible-word-html: 110%"><br/>

And yeah, start with whatever seems easiest.

junaidiiith commented 8 years ago

Unhammer Now the deformatter works like this: input `

hello brother how are you doing?

Output [<1>

<2> ][{<3>}]hello brother[ ][{<4>}]how[ ][{<5><6>}]are[ ][{<7>}][ ]you[ ][{<8><9>}]doing?[ ][<\/p>][ ][<\/div>]` Input `

foo bar

` Output `[<1>][{<2><3>}]foo[{<4>}][ ]bar[<\/p>]` The reformatter works accordingly. I have stored for every id its tag name and the complete string that has to be printed with attributes. For example: 2="p,p class = "some class" id = "some id" Now I can easily find the tag name and its related information. Does it seem correct to you?
unhammer commented 8 years ago

Considering we want it to work for many different formats, I would drop the <> and just comma-separate the inline-blank numbers (so it looks the same whether it's html or rtf or latex or markdown etc.).

And since we don't want to alter non-inline blanks at all, there's really no use in "parsing" them, we can just store them exactly as they are and merge consecutive ones. This would give:

 [1] [{2}]hello brother[ ][{3}]how[ ][{4,5}]are [][{6}] you [][{7,8}]doing?[] [9]

Here I've just used pairs of strings for inline's, and simple strings for non-inline's. Note how 9 is two close-tags - this means the reformatter only has to care about closing inline tags, and doesn't need to know whether it's html or what.

junaidiiith commented 8 years ago

I have modified the deformatter. Now it takes the input in the following manner and stores the information of id, tag_attributes and closing tag in a sqlite3 database tags_data.db, which is generated everytime the deformatter is run.

input: <div id="someid"> <p class="some class" id="some id"> <i>hello brother</i> <u style="italic">how <b>are </b> you <em>doing?</em></u> </p> </div>

Output: [8][{1}]hello brother[ ][{2}]how[ ][{3,4}]are[ ][{5}][ ]you[ ][{6,7}]doing?[9]

Command to run the deformatter: g++ deformatter.cpp -I/usr/include/libxml2 -lxml2 -std=c++11 -lsqlite3 -o def ./def input.html

To run the reformatter: g++ reformatter.cpp -lsqlite3 -std=c++11 -o reformat ./reformat deformatter_output.txt database_file_generated_by_deformatter