talk2dfox / label-alignment

Utilities for transforming sequence annotation between IOB (and variants) and character spans
Other
0 stars 0 forks source link

add support for non-whitespace tokenization #4

Open talk2dfox opened 3 months ago

talk2dfox commented 3 months ago

Currently, tok2spans.iob2spans accepts parallel lists of tokens and IOB-style labels. Since there is no single text, it constructs that text by concatenating the tokens with a single space as a delimiter.

It would be nice to support more flexible tokenization. One possibility is to replace the list of tokens with an existing text together with a tokenization of that text (including mapping tokens to spans).

talk2dfox commented 3 months ago

this would also be useful for QC, as a round trip should yield exactly the same spans