scaife-viewer / beyond-translation-site

Site used to iterate on translation alignments within the Scaife Viewer ecosystem
3 stars 4 forks source link

Override tokenizer #163

Open jacobwegner opened 1 year ago

jacobwegner commented 1 year ago

@jchill-git and I had a call discussing what changes we'd need to make to support custom tokenizers (e.g. leveraging tools from CAMeL).

Our end goal would be to support additional tokenizers / tokenization schemes on a version-by-version basis.

Initially, @jchill-git will produce a CSV and I will be adding some "customization hooks" to scaife-viewer-atlas to use that CSV rather than the "built-in" tokenizer (which simply splits on whitespace).

I got an initial proof-of-concept done today (see screenshot below) and will keep working on this iteratively to support "subword" tokens and punctuation across Scaife Viewer stack (backend / frontend).

image