SentencePiece Tokenizer

bmw-friedrich-mayr commented 3 years ago

It would be great if Vespa could provide support for the more elaborate SentencePiece tokenizer. I opened this feature-request on the basis of this slack thread.

The BertTokenizer is based on word-pieces, which has several issues especially for multilingual tokenization. Hence, most new models use SentencePiece. It would be great if Vespa could provide a SentencePiece tokenizer, so that the community can use recent Transformer models. We have found a Java version https://github.com/levyfan/sentencepiece-jni, but as our team is new to Vespa and Java, we weren't able implement it properly.

Thanks for your great support!!

ace-kay-law-neo commented 3 years ago

Thanks for opening this issue @bmw-friedrich-mayr. I also look forward to a SentencePiece tokenizer.

bratseth commented 3 years ago

There is an Apache 2.0 licenced Java implementation of SentencePiece here: http://docs.djl.ai/extensions/sentencepiece/index.html Looks to me like you can use this in the same way as BertTokenizer - could you try that?

I'm going to look into making integration with this supported out of the box since it's such a common thing, but I don't think you need to wait for that.

janandreschweiger commented 3 years ago

Thanks @bratseth, I will try it and tell you if it worked out.

bmw-friedrich-mayr commented 3 years ago

Hey @bratseth, I heard there were problems with gcc for your suggested implementation from some colleagues. As this feature is somewhat time critical for us, I kindly ask you when Vespa will support the SentencePiece tokenizer? Thank you!!

bratseth commented 3 years ago

I'll start looking into this tomorrow.

bmw-friedrich-mayr commented 3 years ago

Awesome, thank you @bratseth.

bratseth commented 3 years ago

I've confirmed the problem, looking into options ...

bmw-friedrich-mayr commented 3 years ago

Many thanks for your effort @bratseth. A SentencePiece tokenizer is super important to us.

bratseth commented 3 years ago

Just an update. I implemented the sentencepiece tokenizer in pure Java. It's working fine, but I still have some work to package it up.

One question for you: Sentencepiece seems to return tokenizations consistent with assigning each token in the model the same negative score rather than the actual negative scores returned by the protobuf model, leading to always preferring the shortest list of tokens covering the text. Now I'm wondering if I should be bug-compatible or compatible with the intent of the paper, and - as far as I can see - the code *)

As an example, if you use this model: https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model (from https://github.com/bheinzerling/bpemb) to tokenize "hel", do you expect the token(s) [▁hel] or [▁h, el]. The latter has a higher score (-82.0), while the first (-905.0) is returned by SentencePiece.

I'll provide an option for this, but I want to set the right default.

*) https://github.com/google/sentencepiece/blob/fab966ad218c6d3449f7ebf088c8b891afbabec2/src/unigram_model.cc#L908

bmw-friedrich-mayr commented 3 years ago

Wow thank you @bratseth that was fast. To be honest, I've only recently started looking at SentencePiece tokenizer, maybe someone else here can help you. I will also ask some of my colleagues.

bratseth commented 3 years ago

It will be available as an injectable and configurable component in an upcoming Vespa release but probably not before Thursday.

bratseth commented 2 years ago

Documentation : https://docs.vespa.ai/en/embedding.html

vespa-engine / vespa

SentencePiece Tokenizer #18881