Currently, the VTextTokenizer first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).
These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by - but only the ascii one, not on other Unicode variants.
Currently, the
VTextTokenizer
first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by
-
but only the ascii one, not on other Unicode variants.