Open jobergum opened 1 year ago
This is now supported for some time with https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder
HuggingFaceTokenizer (and SentencePieceEmbedder) implements Segmenter so it can (in theory) be used to segment query text. But to be used on the indexing side it also needs to implement Tokenizer (or, alternatively, extent to_string as suggested above).
Also, to use this in practice we need a LinguisticsComponent that returns HuggingFaceTokenizer when asked for linguistics implementations.
I don't think this is usable before those two (simple) things are done, or am I missing something?
Haha, you are obviously right. Sorry, I got confused for a minute there.
For linguistic processing, such as tokenization and stemming, Vespa integrates with Apache OpenNLP . The downside is that not that many languages are supported.
One way to expand language support is to use language-independent tokenizers such as sentencepiece and index the token ids.
We could:
Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters
On the document side, it could look like this using indexing language converters
Note the above fails as the embed IL function expects that the field type is tensor.
On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids.
The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval.