@jchill-git and I had a call discussing what changes we'd need to make to support custom tokenizers (e.g. leveraging tools from CAMeL).
Our end goal would be to support additional tokenizers / tokenization schemes on a version-by-version basis.
Initially, @jchill-git will produce a CSV and I will be adding some "customization hooks" to scaife-viewer-atlas to use that CSV rather than the "built-in" tokenizer (which simply splits on whitespace).
I got an initial proof-of-concept done today (see screenshot below) and will keep working on this iteratively to support "subword" tokens and punctuation across Scaife Viewer stack (backend / frontend).
@jchill-git and I had a call discussing what changes we'd need to make to support custom tokenizers (e.g. leveraging tools from CAMeL).
Our end goal would be to support additional tokenizers / tokenization schemes on a version-by-version basis.
Initially, @jchill-git will produce a CSV and I will be adding some "customization hooks" to
scaife-viewer-atlas
to use that CSV rather than the "built-in" tokenizer (which simply splits on whitespace).I got an initial proof-of-concept done today (see screenshot below) and will keep working on this iteratively to support "subword" tokens and punctuation across Scaife Viewer stack (backend / frontend).