Ability to use sentencepiece tokenization as lingustic implementation

jobergum commented 1 year ago

For linguistic processing, such as tokenization and stemming, Vespa integrates with Apache OpenNLP . The downside is that not that many languages are supported.

One way to expand language support is to use language-independent tokenizers such as sentencepiece and index the token ids.

We could:

Create a Linguistic implementation wrapping sentencepiece.
Use sentencepiece, but add integrations in indexing language to convert the produced tensor to string for indexing, and similar on the query side.

Option 1 is straight forward, option 2 could re-use the embedded functionality and indexing language converters

<container version="1.0">
  <component id="spiece"
           class="com.yahoo.language.sentencepiece.SentencePieceEmbedder"
           bundle="linguistics-components">
    <config name="language.sentencepiece.sentence-piece">;
        <model>
            <item>
              <language>unknown</language>
              <path>model/en.wiki.bpe.vs10000.model</path>
            </item>
        </model>
      </config>
  </component>
</container>

On the document side, it could look like this using indexing language converters

 field title_tokens type string {
        indexing: (input title || "")| embed spiece | to_string | summary
    }

Note the above fails as the embed IL function expects that the field type is tensor.

On the query side, it's more unclear, but one would want to be able to search for both the original string text and the token vocabulary ids.

{
"query": "foo bar",
"yql": "select * from doc where userQuery()",
"input.query(tokens)" : "embed(spiece, foo bar)"
}

The missing piece is how to express how to convert the tensor query(tokens) to string and into the query tree for retrieval.

jobergum commented 4 months ago

This is now supported for some time with https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder

bratseth commented 4 months ago

HuggingFaceTokenizer (and SentencePieceEmbedder) implements Segmenter so it can (in theory) be used to segment query text. But to be used on the indexing side it also needs to implement Tokenizer (or, alternatively, extent to_string as suggested above).

Also, to use this in practice we need a LinguisticsComponent that returns HuggingFaceTokenizer when asked for linguistics implementations.

I don't think this is usable before those two (simple) things are done, or am I missing something?

jobergum commented 4 months ago

Haha, you are obviously right. Sorry, I got confused for a minute there.

vespa-engine / vespa

Ability to use sentencepiece tokenization as lingustic implementation #27039