Fix syntax annotation and add syntactic movement support (through alignments/relations) (T062)

luutuntin commented 5 years ago

Hi Maarten,

I'm exploring FLAT for our annotation project. Opening this example in FLAT, I can access to a variety of annotation types (see the first attached picture); but when I choose "Syntactic Unit -- syntax-set" and click "Add" button, nothing happens (see the second attached picture). Do I miss something?

Best, Alex

PS. I did try to create a FoLiA file with two declared annotation types: POS and Syntax, and encounter the same problem: I can add POS annotations but not Syntax ones.

flat_cannot_add_a_syntactic_unit_annotation

proycon commented 5 years ago

Hi Alex,

That sounds like a bug indeed; a new field should appear for syntax annotation when you click the plus button. I'll see if I can reproduce it and fix it (might take a few days).

proycon commented 5 years ago

Bug reproduces with javascript error:

TypeError: structure[structure_id].spanannotations is undefined[Learn More] flat.js:206:9
forspanannotations
http://mhysa:8080/flat/style/flat.js:206:9
renderparentspanfield
http://mhysa:8080/flat/style/flat.editor.js:510:5
addeditorfield
http://mhysa:8080/flat/style/flat.editor.js:1014:17
editor_oninit/<
http://mhysa:8080/flat/style/flat.editor.js:2121:9
dispatch
http://mhysa:8080/flat/style/jquery-3.1.0.min.js:3:9870
add/q.handle
http://mhysa:8080/flat/style/jquery-3.1.0.min.js:3:7932

luutuntin commented 5 years ago

Yes, I receive the same javascript errors:

flat_javascript_errors_edited

proycon commented 5 years ago

OK, this should be fixed now in v0.7.14.

luutuntin commented 5 years ago

Thank you so much. I just updated to this version and do not encounter this bug.

luutuntin commented 5 years ago

I have another question: what is the best way to handle syntactic movement annotation (i.e. inserting a null element that is co-indexed with another linguistic material in a sentence) in FLAT?

proycon commented 5 years ago

Could you give a specific example of such an annotation perhaps?

luutuntin commented 5 years ago

Here is an example from Syntactic annotation manual for AAPCAppE: syntax_tree We can see that the wh-noun phrase WNP-1, containing What, is co-indexed with the empty trace element *T*-1 under the noun phrase NP-PRD; similarly, the verb be (present) BEP-2, containing is, is co-indexed with the empty element *-2 under the verb be (present) BEP.

luutuntin commented 5 years ago

There is another bug that is demonstrated in the following example:

Task: annotate the syntactic structure of the sentence "They brought the documents on Tuesday ."

First, assign clause label IP-MAT to the whole sentence:

Next, assign phrase label NP-SBJ to "They" and set its parent span to IP-MAT:

Next, assign phrase label VP to"brought the documents on Tuesday" and also set its parent span to IP-MAT:

After this step, the annotation for "They" was unexpectedly excluded from the syntactic annotation:

proycon commented 5 years ago

Thanks for the elaborate report, that indeed looks like another bug (syntactic annotation hasn't been too widely used yet so I'm afraid you're kind of the guinnea pig in this, sorry). I'll investigate and fix it soon!

(I'll get back to you on the syntactic movement issue as well, that may prove challenging in the current setup)

luutuntin commented 5 years ago

I really like the FoLiA paradigm and would love to use FLAT for our project. For syntactic movement annotation, the simplest solution we are thinking of is just inserting the trace symbols into the text, and then assign syntactic labels to them as to normal tokens. As I understand, we can do this kind of insertion using FLAT, right? Again, thank you so much for your great support.

proycon commented 5 years ago

Yes, that simple solution would indeed be a decent workaround and should work right away (after I fix the bug you reported). Perhaps in the underlying FoLiA tokenisation you can also mark such pseudo/trace elements by assigning a special class to the word elements (<w>, as you probably know you can make up any vocabulary which is what I'm doing here too with normal and empty). I worked out your example sentence:

<s xml:id="s1">
    <w xml:id="w1" class="normal">
        <t>What</t>
        <pos class="WPRO" />
    </w>
    <w xml:id="w2" class="normal">
        <t>is</t>
        <pos class="BEP" />
    </w>
    <w xml:id="w3" class="normal">
        <t>your</t>
        <pos class="PRO$" />
    </w>
    <w xml:id="w4" class="normal">
        <t>name</t>
        <pos class="N" />
    </w>
    <w xml:id="w.bep-2" class="empty">
        <t>*-2</t>
    </w>
    <w xml:id="w.wnp-1" class="empty">
        <t>*T*-1</t>
    </w>
    <w xml:id="w5">
        <t>?</t>
        <pos class="PUNC" />
    </w>
    <syntax>
        <su class="CP-QUE-MAT">
            <su xml:id="s1.WNP-1" class="WNP">
                <wref id="w1" />
            </su>
            <su class="IP-SUB">
                <su xml:id="s1.BEP-2" class="BEP">
                    <wref id="w2" />
                </su>
                <su class="NP-SBJ">
                    <su class="NP-POS">
                        <wref id="w3" />
                    </su>
                    <wref id="w4" />
                </su>
                <su class="VP">
                    <su class="BEP">
                        <wref id="w.bep-2" />
                    </su>
                    <su class="NP-PRD">
                        <wref id="w.wnp-1" />
                    </su>
                </su>
            </su>
            <su class="PUNC">
                <wref id="w5" />
            </su>
        </su>
    </syntax>
</s>

I'm not really a fan of using extra words/tokens (<w>) and text content (<t>) for something that technically is empty, but I admit it would be the most practical workaround.

I'm also thinking what the most elegant representation would be from a FoLiA perspective. I'm not very knowledgeable on syntactic movement, but I guess these trace elements should ideally not be expressed in the tokenisation layer but only as part of the syntax tree? The notion of co-indexed should then also be expressed explicitly rather than conventionally, which could be done with FoLiA's alignments (basically higher-order references). I come up with something like this then:

<s xml:id="s1">
    <w xml:id="w1">
        <t>What</t>
        <pos class="WPRO" />
    </w>
    <w xml:id="w2">
        <t>is</t>
        <pos class="BEP" />
    </w>
    <w xml:id="w3">
        <t>your</t>
        <pos class="PRO$" />
    </w>
    <w xml:id="w4">
        <t>name</t>
        <pos class="N" />
    </w>
    <w xml:id="w5">
        <t>?</t>
        <pos class="PUNC" />
    </w>
    <syntax>
        <su class="CP-QUE-MAT">
            <su xml:id="s1.WNP-1" class="WNP">
                <wref id="w1" />
            </su>
            <su class="IP-SUB">
                <su xml:id="s1.BEP-2" class="BEP">
                    <wref id="w2" />
                </su>
                <su class="NP-SBJ">
                    <su class="NP-POS">
                        <wref id="w3" />
                    </su>
                    <wref id="w4" />
                </su>
                <su class="VP">
                    <su class="BEP">
                        <alignment class="A-movement">
                            <aref id="s1.BEP-2" type="su"/>
                        </alignment>
                    </su>
                    <su class="NP-PRD">
                        <alignment class="Wh-movement">
                            <aref id="s1.WNP-1" type="su"/>
                        </alignment>
                    </su>
                </su>
            </su>
            <su class="PUNC">
                <wref id="w5" />
            </su>
        </su>
    </syntax>
</s>

This looks much cleaner to me than the workaround, though it's currently impossible to do in FLAT and would demand an extension (or we could solve it in a postprocessing conversion script, though less elegant). What do you think?

Thanks for considering FLAT for your project! :) It would indeed be great if it can be applied to your task and is a great test-run for syntax annotation for us as well.

luutuntin commented 5 years ago

Yes, the second solution is much cleaner and well demonstrates FoLiA's power. I just wonder what the corresponding syntactic structure looks like. Something likes this? syntax_tree_alignment

As per our plan, we would love to have a stable tool for intensive syntax annotation next summer (i.e. at the beginning of May, 2019). (Currently, we are working hard on developing the annotation guidelines.) Can this go along with your work on the extension? In any case, thoroughly handling movement annotation is essential for the syntax layer.

Regarding the first solution, I tried adding the annotation to my FoLiA file and received the following tree: syntax_tree_insertion We can see that whenever there is a part-of-speech label branch (i.e. a branch whose root node is a part-of-speech label) going out from a syntactic label node, it will not be shown in Tree Viewer. Is this a bug?

Another concern is that we don't know which pseudo/trace elements we want to insert until we do the syntax annotation, and therefore we cannot insert them at tokenisation stage. I haven't figured out how to add new tokens using FLAT. Can you give me a hint?

proycon commented 5 years ago

I'm debugging the original issue with the dissappearing 'They' and can confirm this indeed goes wrong (just documenting this mostly for my own fixing process, I'll answer the other questions in a separate comment):

2018-10-17 20:30:17 - [QUERY ON flat/issue138] ADD su OF test WITH class "IP-MAT" annotator "flat" annotatortype "manual" datetime now confidence NONE FOR SPAN ID issue138.p.1.s.1.w.1 & ID issue138.p.1.s.1.w.2 & ID issue138.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 & ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FORMAT flat RETURN ancestor-target

2018-10-17 20:38:41 - [QUERY ON flat/issue138] EDIT su ID issue138.text.su.1 WITH class "IP-MAT" annotator "flat" annotatortype "manual" datetime now confidence NONE RESPAN ID issue138.p.1.s.1.w.2 & ID issue138.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 & ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FOR ID issue138.p.1.s.1.w.2 , ID issue138.p.1.s.1.w.3 , ID issue138.p.1.s.1.w.4 , ID issue138.p.1.s.1.w.5 , ID issue138.p.1.s.1.w.6 , ID issue138.p.1.s.1.w.7 FORMAT flat RETURN ancestor-focus

2018-10-17 20:38:41 - [QUERY ON flat/issue138] ADD su OF test WITH class "NP-SBJ" annotator "flat" annotatortype "manual" datetime now confidence NONE SPAN ID issue138.p.1.s.1.w.1 FOR ID i ss ue138.text.su.1 FORMAT flat RETURN ancestor-target

2018-10-17 20:40:35 - [QUERY ON flat/issue138] EDIT su ID issue138.text.su.1 WITH class "IP-MAT" annotator "flat" annotatortype "manual" datetime now confidence NONE FORMAT flat RETURN ance stor-focus

2018-10-17 20:40:35 - [QUERY ON flat/issue138] ADD su OF test WITH class "VP" annotator "flat" annotatortype "manual" datetime now confidence NONE SPAN ID issue138.p.1.s.1.w.2 & ID issue13 8.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 & ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FOR ID issue138.text.su.1 FORMAT flat RETURN ancestor-target

The second query seems the culprit and shouldn't have been formed (the fourth also not).

Additionally, after doing this I end up with an inconsistency in the front end when hovering over the annotation: uncaught exception: Error, unable to sort targets, expected 13, got 7, sameparent=issue138.p.1.s.1. It seems there are duplicate targets because of the nesting.

stack trace:

* sort_targets flat.js:755
* getspantext flat.viewer.js:232

Todo:

[x] (a) Fix getspantext() and sort_targets() to handle nested span elements
~(b) Fix detection of changes in case of multiple span elements, prevent erroneous respan queries~

proycon commented 5 years ago

Yes, the second solution is much cleaner and well demonstrates FoLiA's power. I just wonder what the corresponding syntactic structure looks like. Something likes this?

That sounds about right yes. Whether I can visualize the alignments as nicely in the syntax tree viewer remains to be seen though.

We can see that whenever there is a part-of-speech label branch (i.e. a branch whose root node is a part-of-speech label) going out from a syntactic label node, it will not be shown in Tree Viewer. Is this a bug?

Not really a bug as such, the visualisation was only designed to represent syntactic annotation. But inclusion of part of speech tags makes sense. (I was a bit unsure whether to represent certain parts as PoS or syntactic unit (or even both) when translating your example).

Another concern is that we don't know which pseudo/trace elements we want to insert until we do the syntax annotation, and therefore we cannot insert them at tokenisation stage. I haven't figured out how to add new tokens using FLAT. Can you give me a hint?

True, that is indeed problematic in the workaround approach and makes it less ideal. FLAT doesn't really do structure editing (adding words/sentences/etc) yet (this has long been planned in #5) and focusses mostly on annotation. Perhaps it's best to focus on the more elegant solution (with alignments, planned also in #84).

As per our plan, we would love to have a stable tool for intensive syntax annotation next summer (i.e. at the beginning of May, 2019). (Currently, we are working hard on developing the annotation guidelines.) Can this go along with your work on the extension? In any case, thoroughly handling movement annotation is essential for the syntax layer.

It sounds feasible yes, I think I should be able to implement the necessary extensions and bugfixes in the coming two/three months

proycon commented 5 years ago

Further debugging:

Somehow I did end up in a correct state despite the respan error:

Screenshot

When adding a V under VP the representation seemed fine but the tree visualizer couldn't visualize it properly (perhaps due to one child being a su and the others wref), so this might be a new subissue:

[x] (c) Tree visualisation can't deal with heterogenous children (su + wref)

2018-10-18 11:09:35 - [QUERY ON flat/issue138] EDIT su ID issue138.text.su.3 WITH class "VP" annotator "flat" annotatortype "manual" datetime now confidence NONE RESPAN ID issue138.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 & ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FORMAT flat RETURN ancestor-focus ~(this one is wrong again)~

2018-10-18 11:09:35 - [QUERY ON flat/issue138] ADD su OF test WITH class "V" annotator "flat" annotatortype "manual" datetime now confidence NONE SPAN ID issue138.p.1.s.1.w.2 FOR ID issue138.text.su.3 FORMAT flat RETURN ancestor-target

Screenshot

When trying to add a NP I again lost a word (due to the first query) which again seems in instance of subissue b, and I ended up with this mess:

2018-10-18 11:12:38 - [QUERY ON flat/issue138] EDIT su ID issue138.text.su.3 WITH class "VP" annotator "flat" annotatortype "manual" datetime now confidence NONE RESPAN ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FORMAT flat RETURN ancestor-focus ~wrong again!~

2018-10-18 11:12:38 - [QUERY ON flat/issue138] ADD su OF test WITH class "NP" annotator "flat" annotatortype "manual" datetime now confidence NONE SPAN ID issue138.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 FOR ID issue138.text.su.3 FORMAT flat RETURN ancestor-target

Screenshot

proycon commented 5 years ago

Those respans I said were wrong actually are not wrong! I first respan the parent so it doesn't include the wref anymore, as that wref will be covered by the su of the child and we don't want duplicates. That's why we get an EDIT..RESPAN query followed by an ADD query every time.

proycon commented 5 years ago

After a syntactic unit on the first word, the order is wrong:

<syntax>
 <su annotator="flat" annotatortype="manual" class="S" datetime="2018-10-18T12:56:21" xml:id="issue138.text.su.1">
   <wref id="issue138.p.1.s.1.w.2" t="brought"/>
   <wref id="issue138.p.1.s.1.w.3" t="the"/>
   <wref id="issue138.p.1.s.1.w.4" t="documents"/>
   <wref id="issue138.p.1.s.1.w.5" t="on"/>
   <wref id="issue138.p.1.s.1.w.6" t="Tuesday"/>
   <wref id="issue138.p.1.s.1.w.7" t="."/>
   <su annotator="flat" annotatortype="manual" class="PRON" datetime="2018-10-18T12:56:21" xml:id="issue138.text.su.2">
       <wref id="issue138.p.1.s.1.w.1" t="They"/>
   </su>
 </su>
</syntax>

(I thought this might explain subissue (c) in the tree visualisation going wrong, but no, that also goes wrong if the order is correct)

Todo:

[x] (d) Insertion point of ADD queries should be computed in case of nested span elements (su)

proycon commented 5 years ago

The final respan when clearing all of the remaining parent syntactic unit (which is common when the parent unit is fully covered child syntactic units), doesn't happen is and instead if a kind of no-operation:

EDIT su ID issue138.text.su.1 WITH class "S" annotator "flat" annotatortype "manual" datetime now confidence NONE FORMAT flat RETURN ancestor-focus

Probably because I didn't allow RESPANs to be empty, but that is valid and necessary here. So, new subissue replacing (b) (which turned out not to be wrong):

[x] (e) Allow RESPAN NONE on parent when inserting children (without deleting the parent span)

luutuntin commented 5 years ago

It is really enjoyable to follow your debugging process.

Actually, the visualisation issue I mentioned in my previous post is the same as (c) Tree visualisation can't deal with heterogenous children (su + wref). In the example "What is your name ?", PoS labels include WPRO, BEP, PRO$, N, PUNC (full reference).

I also noticed the wrong order you mentioned in your todo (d) post when I experimented with "They brought the documents on Tuesday ."

I just reviewed possible empty categories in syntax annotation and found that, in addition to traces of movement which can be handled by alignment annotation, there are other empty categories such as empty subjects which cannot be aligned to any available tokens and therefore require the insertion of new tokens. For example, in the following sentence there is an empty expletive subject, *exp*, as the silent counterpart of an existential subject. syntax_tree_empty_expletive_subject If we still want to preserve the original text content layer, should we extend FoLiA's specification so that we can insert additional tokens for this kind of empty categories into the syntax layer (and this may be a desirable feature for other annotations such as implicit semantic roles)? Something likes this:

<s xml:id="s.1">
    <w xml:id="s.1.w.1">
        <t>Is@</t>
        <pos class="BEP" />
    </w>
    <w xml:id="s.1.w.2">
        <t>@n't</t>
        <pos class="NEG" />
    </w>
    <w xml:id="s.1.w.3">
        <t>a</t>
        <pos class="D" />
    </w>
    <w xml:id="s.1.w.4">
        <t>whole</t>
        <pos class="ADJ" />
    </w>
    <w xml:id="s.1.w.5">
        <t>lot</t>
        <pos class="N" />
    </w>
    <w xml:id="s.1.w.6">
        <t>left</t>
        <pos class="VAN" />
    </w>
    <w xml:id="s.1.w.7">
        <t>.</t>
        <pos class="PUNC" />
    </w>
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <w xml:id="s.1.su.w.1">
                    <t>*exp*</t>
                    <pos class="EX" />
                </w>
            </su>
            <su xml:id="s.1.su.3" class="VP">
                <su xml:id="s.1.su.4" class="BEP">
                    <wref id="s.1.w.1" />
                </su>
                <su xml:id="s.1.su.5" class="NEG">
                    <wref id="s.1.w.2" />
                </su>
                <su xml:id="s.1.su.6" class="VP">
                    <su xml:id="s.1.su.7" class="NP=LGS">
                        <wref id="s.1.w.3" />
                        <su xml:id="s.1.su.8" class="ADJP">
                            <wref id="s.1.w.4" />
                        </su>
                        <wref id="s.1.w.5" />
                    </su>
                    <wref id="s1.w.6" />
                </su>
            </su>
            <su class="PUNC">
                <wref id="s.1.w.7" />
            </su>
        </su>
    </syntax>
</s>

proycon commented 5 years ago

Glad the debugging is interesting ;) Sorry it took a while again (doing this in the midst of working on lots of other things too).

You raise a very interesting point regarding the empty words, we don't really have a good mechanism yet to explicitly accommodate such ghost words/tokens (if you have a more linguistically sound term I'd be glad to hear it :) ) so we indeed might need to extend FoLiA there. If you really want to do that in the current FoLiA then I'm thinking of a solution using a set definition for words/tokens that contain a class "empty/implicit/ghost" or however you want to call it, as opposed to a class which I'll "normal".

<s xml:id="s.1">
    <w xml:id="s.1.w.0" class="ghost">
        <t>*exp*</t>
        <pos class="EX" />
    </w>
    <w xml:id="s.1.w.1" class="normal">
        <t>Is@</t>
        <pos class="BEP" />
    </w>
    <w xml:id="s.1.w.2" class="normal">
        <t>@n't</t>
        <pos class="NEG" />
    </w>
    <w xml:id="s.1.w.3" class="normal">
        <t>a</t>
        <pos class="D" />
    </w>
    <w xml:id="s.1.w.4" class="normal">
        <t>whole</t>
        <pos class="ADJ" />
    </w>
    <w xml:id="s.1.w.5" class="normal">
        <t>lot</t>
        <pos class="N" />
    </w>
    <w xml:id="s.1.w.6" class="normal">
        <t>left</t>
        <pos class="VAN" />
    </w>
    <w xml:id="s.1.w.7" class="normal">
        <t>.</t>
        <pos class="PUNC" />
    </w>
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" />
            </su>
            <su xml:id="s.1.su.3" class="VP">
                <su xml:id="s.1.su.4" class="BEP">
                    <wref id="s.1.w.1" />
                </su>
                <su xml:id="s.1.su.5" class="NEG">
                    <wref id="s.1.w.2" />
                </su>
                <su xml:id="s.1.su.6" class="VP">
                    <su xml:id="s.1.su.7" class="NP=LGS">
                        <wref id="s.1.w.3" />
                        <su xml:id="s.1.su.8" class="ADJP">
                            <wref id="s.1.w.4" />
                        </su>
                        <wref id="s.1.w.5" />
                    </su>
                    <wref id="s1.w.6" />
                </su>
            </su>
            <su class="PUNC">
                <wref id="s.1.w.7" />
            </su>
        </su>
    </syntax>
</s>

But I think you are right that a more explicit notation may be necessary for these kinds of issues, rather than leaving it to the set definition. I like you're suggestion, it's simple enough and currently not allowed so it's on the table as an option.

We'd have to consider that:

A word may be referenced multiple times in span annotation (so logically I think the first one could be a <w> as suggest and the subsequent ones normal <wref>)
Span annotation elements are not always strict on order (<su> by definition is though). Having a <w> appear inside span annotation might leave no clues to where it would be in the original text. Whether that is a problem or not I don't know yet, because they are ghost words/tokens that are not in the original anyway and I do prefer the idea of confining them to their span annotation layer because there may be different kind of ghost words/tokens for different annotation types (you already suggested semantic roles).
The FoLiA libraries will need some mechanism to determine whether a word/token is a ghost/implicit/empty one or not, as it's common to pass Word instances to span annotation elements directly which are resolved to wrefs internally.
For FLAT this raises more challenges as to how actually allow this in the interface.

proycon commented 5 years ago

Subissue (c), the treeviewer not showing mixed content (su + wref), is caused by FLAT's representation pulling the two apart. The child su gets seen as an annotation and the wrefs get seen as the target. Unfortunately the ordering information is not readily available at that point so I'll have to devise a solution.

Subissue (d), wrong order, also still appears in this example, so the test failed:

        <syntax>
          <su annotator="flat" annotatortype="manual" class="S" datetime="2018-10-18T13:08:30" xml:id="issue138.text.su.1">
            <su annotator="flat" annotatortype="manual" class="PRON" datetime="2018-10-18T12:56:21" xml:id="issue138.text.su.2">
              <wref id="issue138.p.1.s.1.w.1" t="They"/>
            </su>
            <su annotator="flat" annotatortype="manual" class="VP" datetime="2018-10-30T15:12:03" xml:id="issue138.text.su.3">
              <wref id="issue138.p.1.s.1.w.2" t="brought"/>
              <wref id="issue138.p.1.s.1.w.5" t="on"/>
              <wref id="issue138.p.1.s.1.w.6" t="Tuesday"/>
              <wref id="issue138.p.1.s.1.w.7" t="."/>
              <su annotator="flat" annotatortype="manual" class="NP" datetime="2018-10-30T15:12:03" xml:id="issue138.text.su.4">
                <wref id="issue138.p.1.s.1.w.3" t="the"/>
                <wref id="issue138.p.1.s.1.w.4" t="documents"/>
              </su>
            </su>
          </su>
        </syntax>

luutuntin commented 5 years ago

Further debugging:

Somehow I did end up in a correct state despite the respan error:

I tried to replicate this by first creating IP-MAT for the whole sentence, then VP for "brought the documents on Tuesday ." (which is actually not accurate as the punctuation should be excluded from VP), and finally NP-SBJ for "They" and ended up having the following:

I guess the difference between your original annotation and my replication is that <wref t='They' ...> appears in both IP-MAT and NP-SBJ levels in FoLiA XML file, which is not desirable and even not legitimate. And this may explain the fact that I cannot open Annotation Editor dialog for further annotation for this particular sentence.

After creating IP-MAT for the whole sentence, if I exclude the punctuation token from VP or create NP-SBJ before VP, I will end up loosing VP or NP-SBJ respectively.

In any case, the last annotated span is always inserted into its parent <su> as the last child, which can cause the wrong order issue as we already noticed. In the replication case, this last annotated span does not replace its previous sister but its component token (i.e. <wref t='They' ...>) is not removed from its parent <su> (i.e. <su class='IP-MAT' ...>). In the cases relating the annotation loss, this last annotated span replaces its previous sister, which causes the loss.

I wonder how you can create PRON and VP (before experimenting with NP creation) in the example of your last post. Is that the case when you hard-code them in FoLiA XML file?

Subissue (c), the treeviewer not showing mixed content (su + wref), is caused by FLAT's representation pulling the two apart. The child su gets seen as an annotation and the wrefs get seen as the target. Unfortunately the ordering information is not readily available at that point so I'll have to devise a solution.

As I understand, currently, when we add a new child <su>, FLAT considers the <wref>s as its targets. How about identifying the parent <su> as the (only) target instead? (In this case, we need a dummy root <su>.) Sorry if my understanding is too naive.

Another note on allowing <w> under span annotation units: in addition to empty categories, we may want to split a word (or join words) in syntax annotation layer, but not in text content layer. For example, in sentence "Lemme go .", we may want to keep the orthographic word "lemme" in text content layer, but split it into "lem@" and "@me" in syntax annotation layer.

Last (currently) but not least, I really like how open-minded you are in terms of FoLiA development. Actually, your bottom-up and practice-driven fashion immediately attracted me when I got to know FoLiA and its rich infrastructure of tools.

proycon commented 5 years ago

Small update: I haven't forgotten about this but since this issue requires changes in FoLiA and its libraries I'm taking it along in the development of FoLiA v2.0 which is currently in full progress (and which depends on a fair amount of other new stuff too so takes some time).

luutuntin commented 5 years ago

I'm sure that you are working hard on certain radical changes. Thank you.

proycon commented 5 years ago

Now FoLiA v2 is released, I'm working on FLAT again and making progress regarding this issue. I hadn't commented on this yet though:

Another note on allowing under span annotation units: in addition to empty categories, we may want to split a word (or join words) in syntax annotation layer, but not in text content layer. For example, in sentence "Lemme go .", we may want to keep the orthographic word "lemme" in text content layer, but split it into "lem@" and "@me" in syntax annotation layer.

That would be possible by introducing a morphology layer on the "lemme" word, with two morphemes, and then link to the morphemes from the syntax layer. FoLiA supports that but it may require some additional work in FLAT still.

proycon commented 5 years ago

subissue d) (Insertion point of ADD queries should be computed in case of nested span elements (su)) seems okay now.

proycon commented 5 years ago

A new subissue f arose, which may be related to e) (Allow RESPAN NONE on parent when inserting children (without deleting the parent span) ).

[x] (f) Left part of tree gets lost when attaching a new right part, parent respanned to cover only the right part, left sibling gets deleted.

This seems an instance of what @luutuntin already reported here:

After creating IP-MAT for the whole sentence, if I exclude the punctuation token from VP or create NP-SBJ before VP, I will end up loosing VP or NP-SBJ respectively.

FQL queries:

USE proycon/issue138 PROCESSOR name "proycon" type manual IN $FLAT_PROCESSOR IN $FOLIADOCSERVE_PROCESSOR EDIT su ID issue138.text.su.1 WITH class "S" datetime now confidence NONE RESPAN NONE FORMAT flat RETURN ancestor-focus

USE proycon/issue138 PROCESSOR name "proycon" type manual IN $FLAT_PROCESSOR IN $FOLIADOCSERVE_PROCESSOR ADD su OF adhoc WITH class "VP" datetime now confidence NONE SPAN  ID issue138.p.1.s.1.w.2 & ID issue138.p.1.s.1.w.3 & ID issue138.p.1.s.1.w.4 & ID issue138.p.1.s.1.w.5 & ID issue138.p.1.s.1.w.6 & ID issue138.p.1.s.1.w.7 FOR ID issue138.text.su.1 FORMAT flat RETURN ancestor-target

This seems caused by the RESPAN NONE (which was also the subject of subissue e), I'll have to make sure that RESPAN NONE does NOT affect any underlying child elements.

proycon commented 5 years ago

Subissue f is fixed now, e is also confirmed solved. Subissue c remains still.

proycon commented 5 years ago

Tree visualisation (subissue c) is fixed now as well

proycon commented 5 years ago

A short summary of dependencies to be implemented for proper syntactic movement support:

[done] ~Add support for hidden words/token and allow them to be targets for references~ #141
[done] ~Add support for appending/prepending hidden words in the annotation editor~ #145
[done] Implement support for relations #84

These should be realisable in the immediate short term (I hope), even though #84 is fairly big component.

The next lists what is additionally needed if you want to refer to sub-parts of a token (i.e. morphemes) rather than to a whole token. A work around is to adapt the tokenisation layer (in a preprocessing step prior to FLAT).

Allowing using morphemes as a target in span annotation #142
Implement support for viewing and editing of morphemes and phonemes #13 (this would be a big component and not realisible in the immediate short term).

proycon commented 5 years ago

@luutuntin Привет! FLAT v0.8.0 has been released a bit over a week ago, implementing a lot of syntax annotation fixes stemming from this issue. As I already mentioned in the previous summary post, I plan to implement relations (#84) for v0.9.0 (aiming for the beginning of June as I have some other priorities in other projects first). I just wanted to check if you guys are still planning on using FLAT for your syntactic movement annotation task, and what your timeline is? Can you also let me know if issue proycon/folia#50 has more or less priority for you?

luutuntin commented 5 years ago

We are starting morphological annotation now; therefore, issue proycon/folia#50 has more priority for us. Regarding syntactic annotation, we are developing the guidelines, and will start (rule-based) automatic annotation first, which does not involve using FLAT. I don't think we will start any manual correction of syntactic annotation (using FLAT) before July. Again, thank you so much.

luutuntin commented 5 years ago

In addition, proycon/flat#134 is also relevant to us, because our annotators of morphological analysis may detect some errors in the transcripts and want to make certain corrections in FLAT, for instance, adding quotes.

luutuntin commented 5 years ago

In addition, proycon/flat#134 is also relevant to us, because our annotators of morphological analysis may detect some errors in the transcripts and want to make certain corrections in FLAT, for instance, adding quotes.

Sorry for my ignorance. This is solved by proycon/flat/#145. And I just realized that in FoLiA documentation the introduction and specification of hidden token annotation are the same as those of token annotation, which should not be. You might have forgotten to update them.

proycon commented 5 years ago

Right, good point, something went wrong there indeed. I'll fix it!

proycon commented 5 years ago

I released FLAT v0.9.0 last week, this implements the essentials that should enable syntactic movement annotation, as mentioned in my comment from April 18th. Proper support for alternative annotations is also implemented in the latest release.

I suggest we make a new issue to continue the discussion on additional features needed for syntactic movement (this one is getting rather long and most has been resolved).

luutuntin commented 5 years ago

That sounds great. Thank you.

proycon / flat

Fix syntax annotation and add syntactic movement support (through alignments/relations) (T062) #138