Open unhammer opened 8 years ago
Yes we can do it both ways. Infact using just the 'ids' will be simpler to handle as we can just make a pair of 'id' and 'type'(inline or non-inline) and put it in the tags. Should I do it?
You don't need a type=(inline|non-inline)
in the stand-off document, the stream format gives you that. If you see [2]
then you know it's non-inline, if you see [{1}]
then you know it's inline; the stand-off document just needs a mapping from id's to content:
1 = <em class="bar">
2 = <div id="foo"><p><p style="horrible-word-html: 110%"><br/>
And yeah, start with whatever seems easiest.
Unhammer Now the deformatter works like this: input `
hello brother how are you doing?
Output
[<1>
<2>
][{<3>}]hello brother[
][{<4>}]how[
][{<5><6>}]are[ ][{<7>}][ ]you[
][{<8><9>}]doing?[
][<\/p>][
][<\/div>]`
Input
`foo bar
` Output `[<1>][{<2><3>}]foo[{<4>}][ ]bar[<\/p>]` The reformatter works accordingly. I have stored for every id its tag name and the complete string that has to be printed with attributes. For example: 2="p,p class = "some class" id = "some id" Now I can easily find the tag name and its related information. Does it seem correct to you?Considering we want it to work for many different formats, I would drop the <>
and just comma-separate the inline-blank numbers (so it looks the same whether it's html or rtf or latex or markdown etc.).
And since we don't want to alter non-inline blanks at all, there's really no use in "parsing" them, we can just store them exactly as they are and merge consecutive ones. This would give:
[1] [{2}]hello brother[ ][{3}]how[ ][{4,5}]are [][{6}] you [][{7,8}]doing?[] [9]
<div id="someid"> <p class="some class" id="some id">
<i>
, </i>
)<u style="italic">
, </u>
)<u style="italic">
, </u>
)<b>
, </b>
)<u style="italic">
, </u>
)<u style="italic">
, </u>
)<em>
, </em>
)</p> </div>
Here I've just used pairs of strings for inline's, and simple strings for non-inline's. Note how 9 is two close-tags - this means the reformatter only has to care about closing inline tags, and doesn't need to know whether it's html or what.
I have modified the deformatter. Now it takes the input in the following manner and stores the information of id, tag_attributes and closing tag in a sqlite3 database tags_data.db, which is generated everytime the deformatter is run.
input:
<div id="someid"> <p class="some class" id="some id"> <i>hello brother</i> <u style="italic">how <b>are </b> you <em>doing?</em></u> </p> </div>
Output:
[8][{1}]hello brother[ ][{2}]how[ ][{3,4}]are[ ][{5}][ ]you[ ][{6,7}]doing?[9]
Command to run the deformatter:
g++ deformatter.cpp -I/usr/include/libxml2 -lxml2 -std=c++11 -lsqlite3 -o def
./def input.html
To run the reformatter:
g++ reformatter.cpp -lsqlite3 -std=c++11 -o reformat
./reformat deformatter_output.txt database_file_generated_by_deformatter
Currently, the output of the html deformatter is a mix between stand-off and in-document markup, e.g.
[<p1> ][{<i2>}]de jour[]? [<div3>]Yes[<\/div> <\/p>]
uses "p" but hides attributes. I think it should be one or the other, ie. either completely standoff[1][{2}]de jour[]?[3]Yes[4]
or completely in-document[<p id="foo"> ][{<i ng-click="x">}]de jour[]? [<div class="meh">]Yes[<\/div> <\/p>]
.I prefer in-document, since it doesn't require keeping temporary files around, which can be handy when developing (you can just pipe a single stream around), although I see the value in stand-off as well (easier to focus on content when seeing just part of the stream, for one).
Perhaps we could have both? E.g. the default is in-document as in apertium-deshtml, while if you run
deshtml2 -s markup.txt < in.html >out.html
thenmarkup.txt
contains definitions of references, whileout.html
is stand-off annotated.