nert-nlp / streusle

STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
Creative Commons Attribution Share Alike 4.0 International
63 stars 17 forks source link

User-friendly concordance format and token update script #54

Closed nschneid closed 4 years ago

nschneid commented 5 years ago

For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.

Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)

There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?

Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.

nschneid commented 5 years ago

Need to determine whether tquery.py can print token IDs (for updateability). Rather than displaying the full sentence as a string with the target token highlighted, maybe display the left and right context in separate columns, and cap their length. Consider showing MWE markup in the context.

nschneid commented 5 years ago

Option to show full tagging in context, like in streusvis.py?

nschneid commented 5 years ago

Spec for tupdate.py:

INPUT: streusle.json edited_tquery_output_tsv

  1. Make sure the 2 header rows are present in edited_tquery_output_tsv
    • The first header row of edited_tquery_output_tsv contains the commit hash. Warn if that does not match the current git commit hash (this could indicate that the data has been modified since tquery.py was run).
    • The second header row of edited_tquery_output_tsv specifies the column headers. Ensure _sentid and _tokoffset are present, and at least one of {ss, ss2, lexcat}.
  2. Check for edits to prohibited fields
    • For each row and field: compare against the JSON (or original tquery output file?) to see if changes have been made. If the field is anything other than ss, ss2, or lexcat, throw an error.
  3. Implement token edits to ss, ss2, and lexcat by updating the JSON data structure. Do not validate these fields as other scripts will do that. Regenerate lextag accordingly.
  4. Print an updated JSON (to be converted back to conllulex).
nschneid commented 5 years ago

For modifying MWEs as well as tags, it would be nice to be able to edit an inline format of the kind produced by streusvis.py, but enhanced to include lexcats.

For that, we need to be able to parse the format. mwerender.py defines render(); we need to add unrender(). It could work as follows:

Signature:

def unrender(rendered, toks):
    assert not any((not t) or ' ' in t for t in toks)
    ...
    return bio_tagging
  1. Given the sentence with MWE and tag markup, construct a regex to identify which characters belong to tokens and which are markup. As we know the tokens, we can avoid assumptions about their characters (they may contain _, ~, and |).

    if len(toks)==1:
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?)'
    elif len(toks)==2: # no gaps allowed
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?[ ~]|_)'
                   rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
    else:
        reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?( |~ ?)|_ ?)'
        for i in range(1,len(toks)-2):
            reMarkup += rf'(?P<t{i}>{re.escape(toks[i])})((?P<T{i}>\|[^ _~]+)?( |~ ?| [~_])|_ ?)'
        reMarkup += rf'(?P<t{len(toks)-2}>{re.escape(toks[-2])})'
                    rf'((?P<T{len(toks)-2}>\|[^ _~]+)?( | ?~| _)|_)'
                    rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
    matches = re.match(reMarkup, rendered)
    if not matches:
        raise ValueError(f'Invalid markup: {rendered}')
    groups = matches.groupdict()   # regex named groups, not MWE groups
    # Groups t0, t1, ..., tn match the tokens
    # Groups T0, T1, ..., Tn match the supersense/lexcat tags where present
    # Everything else is markup. Note that this does not fully validate the markup; unclosed gaps are allowed, and labels on strong MWEs are optional.
  2. For each token as it occurs in the rendered string, look at the characters immediately left and right (ignoring the tag if present) to determine the appropriate BIO tag:

    bio_tagging = []
    for i in range(len(toks)):
        # l, r = MWE markup/spaces on left and right
        if i==0: l = '^'
        else:
            l = rendered[matches.end(f'T{i-1}' if f'T{i-1}' in groups else f't{i-1}'):matches.start(f't{i}')]
    
        if i==len(toks)-1: r = '$'
        else:
            r = rendered[matches.end(f'T{i}' if f'T{i}' in groups else f't{i}'):matches.start(f't{i+1}')]
    
        assert l in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '^'}
        assert r in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '$'}
        ingap = False
        if i>0 and l=='_': tag = 'i_' if ingap else 'I_'
        elif i>0 and l=='~': tag= 'i~' if ingap else 'I~'
        elif i>0 and l==' _': assert ingap=='_'; ingap = False; tag = 'I_'
        elif i>0 and l==' ~': assert ingap=='~'; ingap = False; tag = 'I~'
        elif r==' ' or r=='$': tag = 'o' if ingap else 'O'
        else: tag = 'b' if ingap else 'B'
    
        bio_tagging.append(tag)
        if r=='_ ' or r=='~ ': assert not ingap; ingap = r.strip()
    assert not ingap
  3. Generate the labeled BIO tags and infer the rest of the JSON from that. For any sentence where the lextags have changed:

    1. Convert the JSON for the sentence to conllulex. Requires updating json2conllulex.py so the conversion can happen without an actual file.
    2. Strip out the lexical semantic analyses. Requires an update to conllulex2UDlextag.py.
    3. Substitute the modified lextags in the last column.
    4. Convert the modified UDlextag to JSON. Requires an update to UDlextag2json.py.
    5. Re-render the sentence to make sure it matches what the user specified.