tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

Confirm correctness of character offsets #15

Closed marcverhagen closed 7 years ago

marcverhagen commented 8 years ago

Looking at the output example in docs/design/1-toplevel.html, which was created from the following input

<?xml version="1.0" ?>
<text>He sleeps on Friday.</text>

it appears that the start and end of the doc_element are wrong since we have <text id="1" begin="1" end="22" /> and <doc_element type="TarsqiDocParagraph" begin="0" end="23">. Given that the text tag spans all text, you expect the doc_element to be inside of the text tag.

But note that even though the text tag in the input seems to span all text, it still starts at position 1 in the file without the tags since we have the newline character after the xml declaration, and it ends one character before the end because of the newline after the closing text tag.

Here is a fragment from the output:

<text>
She sleeps on Friday.
</text>

The problem is that the text tags in the input and output are difference things. Yet the text tag in the input is the same as the <text id="1" begin="1" end="22" /> in the output.

This is a bit confusing. I wonder if I should add something to this effect in the documentation, or perhaps use a less common name than text for the tag that spans all source text, perhaps something like <primary_data>.

marcverhagen commented 7 years ago

This is all okay, but an explanation was added to docs/notes/offset-confusion.md.