Closed nschneid closed 4 years ago
Need to determine whether tquery.py can print token IDs (for updateability). Rather than displaying the full sentence as a string with the target token highlighted, maybe display the left and right context in separate columns, and cap their length. Consider showing MWE markup in the context.
Option to show full tagging in context, like in streusvis.py?
Spec for tupdate.py:
INPUT: streusle.json edited_tquery_output_tsv
_sentid
and _tokoffset
are present, and at least one of {ss
, ss2
, lexcat
}.ss
, ss2
, or lexcat
, throw an error.ss
, ss2
, and lexcat
by updating the JSON data structure. Do not validate these fields as other scripts will do that. Regenerate lextag
accordingly.For modifying MWEs as well as tags, it would be nice to be able to edit an inline format of the kind produced by streusvis.py, but enhanced to include lexcats.
For that, we need to be able to parse the format. mwerender.py defines render()
; we need to add unrender()
. It could work as follows:
Signature:
def unrender(rendered, toks):
assert not any((not t) or ' ' in t for t in toks)
...
return bio_tagging
Given the sentence with MWE and tag markup, construct a regex to identify which characters belong to tokens and which are markup. As we know the tokens, we can avoid assumptions about their characters (they may contain _
, ~
, and |
).
if len(toks)==1:
reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?)'
elif len(toks)==2: # no gaps allowed
reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?[ ~]|_)'
rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
else:
reMarkup = rf'^(?P<t0>{re.escape(toks[0])})((?P<T0>\|[^ _~]+)?( |~ ?)|_ ?)'
for i in range(1,len(toks)-2):
reMarkup += rf'(?P<t{i}>{re.escape(toks[i])})((?P<T{i}>\|[^ _~]+)?( |~ ?| [~_])|_ ?)'
reMarkup += rf'(?P<t{len(toks)-2}>{re.escape(toks[-2])})'
rf'((?P<T{len(toks)-2}>\|[^ _~]+)?( | ?~| _)|_)'
rf'(?P<t{len(toks)-1}>{re.escape(toks[-1])})(?P<T{len(toks)-1}>\|[^ _~]+)?$'
matches = re.match(reMarkup, rendered)
if not matches:
raise ValueError(f'Invalid markup: {rendered}')
groups = matches.groupdict() # regex named groups, not MWE groups
# Groups t0, t1, ..., tn match the tokens
# Groups T0, T1, ..., Tn match the supersense/lexcat tags where present
# Everything else is markup. Note that this does not fully validate the markup; unclosed gaps are allowed, and labels on strong MWEs are optional.
For each token as it occurs in the rendered string, look at the characters immediately left and right (ignoring the tag if present) to determine the appropriate BIO tag:
bio_tagging = []
for i in range(len(toks)):
# l, r = MWE markup/spaces on left and right
if i==0: l = '^'
else:
l = rendered[matches.end(f'T{i-1}' if f'T{i-1}' in groups else f't{i-1}'):matches.start(f't{i}')]
if i==len(toks)-1: r = '$'
else:
r = rendered[matches.end(f'T{i}' if f'T{i}' in groups else f't{i}'):matches.start(f't{i+1}')]
assert l in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '^'}
assert r in {' ', '_', '~', '_ ', '~ ', ' _', ' ~', '$'}
ingap = False
if i>0 and l=='_': tag = 'i_' if ingap else 'I_'
elif i>0 and l=='~': tag= 'i~' if ingap else 'I~'
elif i>0 and l==' _': assert ingap=='_'; ingap = False; tag = 'I_'
elif i>0 and l==' ~': assert ingap=='~'; ingap = False; tag = 'I~'
elif r==' ' or r=='$': tag = 'o' if ingap else 'O'
else: tag = 'b' if ingap else 'B'
bio_tagging.append(tag)
if r=='_ ' or r=='~ ': assert not ingap; ingap = r.strip()
assert not ingap
Generate the labeled BIO tags and infer the rest of the JSON from that. For any sentence where the lextags have changed:
For revising certain classes of annotations (e.g., P supersenses where the scene role in Manner) it would be useful to have a concordance view. This would put a token's context on the same line for easy sorting and batch editing. So it would be a more human-readable view of the lexical annotation.
Does tquery.py already do this? Should it be run when building the release to produce a row for every supersense-annotated strong lexical expression, within the train/dev/test subdiretories? This would make diffs in commit history easier to read. (Not having this in the root directory would make it clear that .conllulex is the canonical data file.)
There would need to be a script to apply supersense edits made in the concordance view back to the original. untquery.py? tupdate.py?
Is there a natural way to specify MWE edits in the concordance view, also? Currently, adding an MWE or changing the strength of an existing MWE is painful to implement by hand in .conllulex.