opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
56 stars 58 forks source link

[RFC] Model-baed Tokenizer for Text Chunking #794

Open yuye-aws opened 2 weeks ago

yuye-aws commented 2 weeks ago

Since OpenSearch 2.13, fixed token length algorithm is available in text chunking processor. For fixed token length algorithm, users can specify the token limit for each chunked passages. A common use case for text chunking processor is to append a text embedding processor. With text chunking processor, users can circumvent the information loss due to truncation from downstream text embedding models.

By OpenSearch 2.15, fixed token length algorithm only supports word tokenizers. The text embedding models perform truncation on long texts if exceeding token limit by their own tokenizer. Given the disparity between word tokenizers and model-based tokenizers, It is hard for users to assign a perfect value to parameter for token_limit. We are initiating this RFC to solicit feedbacks from the community to determine whether and how to implement model-based tokenizer for fixed token length algorithm.

Introduction

Tokenization is the process of segmenting a string into a list of individual tokens. Prior to text embedding, language models perform tokenization on the input texts. Each language model have its own Model-based tokenizers.

The tokenization results varies across different tokenizers. We showcase the difference between word tokenizers and model-based tokenizers with a simple example. The same input string will be tokenized with standard tokenizer and the tokenizer from model sentence-transformers/msmarco-distilbert-base-tas-b.

// input 
"It’s fun to contribute a brand-new PR or 2 to OpenSearch!"

// standard tokenizer
['`It’s'`, '`fun'`, '`to'`, '`contribute'`, '`a'`, '`brand'`, '`new'`, '`PR'`, '`or'`, '`2'`, '`to'`, '`OpenSearch'`]

// sentence-transformers/msmarco-distilbert-base-tas-b
['[CLS]', 'it', '’', 's', 'fun', 'to', 'contribute', 'a', 'brand', '-', 'new', 'pr', 'or', '2', 'to', 'opens', '##ear', '##ch', '!', '[SEP]']

where [CLS] indicates the beginning of a sentence and [SEP] splits sentences. As we can see from the example above, the tokens returned by the standard tokenizer and the model-based tokenizer are quite different. The standard tokenizer returns 12 tokens and the model-based tokenizer returns 20 tokens.

In our first release, we can start with the tokenizers for OpenSearch-provided pretrained models. These models usually do not share the same vocabulary corpus and tokenizer. For disambiguation, we need to support a dedicated tokenizer for each of the following model.

Sentence transformers

  1. huggingface/sentence-transformers/all-distilroberta-v1
  2. huggingface/sentence-transformers/all-MiniLM-L6-v2
  3. huggingface/sentence-transformers/all-MiniLM-L12-v2
  4. huggingface/sentence-transformers/all-mpnet-base-v2
  5. huggingface/sentence-transformers/msmarco-distilbert-base-tas-b
  6. huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
  7. huggingface/sentence-transformers/multi-qa-mpnet-base-dot-v1
  8. huggingface/sentence-transformers/paraphrase-MiniLM-L3-v2
  9. huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  10. huggingface/sentence-transformers/paraphrase-mpnet-base-v2
  11. huggingface/sentence-transformers/distiluse-base-multilingual-cased-v1

Sparse encoding models

  1. amazon/neural-sparse/opensearch-neural-sparse-encoding-v1
  2. amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1
  3. amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1

Cross-encoder models

  1. huggingface/cross-encoders/ms-marco-MiniLM-L-6-v2
  2. huggingface/cross-encoders/ms-marco-MiniLM-L-12-v2

Pros and cons

Here are the pros and cons for model-based tokenizers:

Pros

  1. Enables users to precisely chunk their document according to truncation limit by downstream text embedding models.
  2. Model-based tokenizers are free from the max token count limit by word tokenizer, which is default set to 10000.

Cons

  1. Unlike word tokenizer which returns the start and end offset of every token, the model-based tokenizer only returns a list of tokens. As we can see from the example above, model-based tokenizer would modify the original input. Users may get confused with the content change.
  2. Model-based tokenizer may generate some new characters. For example, the word OpenSearch will be tokenized into ['opens', '##ear', '##ch']. It is unclear how to reformat these tokens into human readable texts.
  3. May introduce user confusion on how to specify word tokenizer and model-tokenizer from API.

API

There are two options to use model-based tokenizer in fixed token length algorithm. Please note that we can support both of them. For example, we can implement one option in the first release and then support another option later.

Option 1

Specify tokenizer with pretrained model name.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Users can use model tokenizer without deploying the text embedding model.

Cons

  1. Do not support the tokenizers from user uploaded model.
  2. Hard for user to remember the whole model name.
  3. The tokenizer is not reusable.

Option 2

After deploying a text embedding model, users can assign the model id to tokenizer.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <model_id for pretrained models>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Supports tokenization for any deployed model.

Cons

  1. Users need to deploy the text embedding model.
  2. Introduce a new parameter named tokenizer_model_id. We need to consider the contradiction between the already existing parameter tokenizer.
  3. Need to consider invalid models like text image embedding models.

Option 3

Unlike text embedding models, the tokenizer only needs files like tokenizer.json, tokenizer_config.json and vocab.txt. Following the behavior of registering models in ml-commons plugin, users can register their tokenizer without the model weights.

POST /_plugins/_ml/tokenizers/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
}

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <tokenizer_id for pretrained tokenizers>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Users can register their tokenizer alone, which saves time and space consumption for model deployment.
  2. After deploying the tokenizer, users can reuse it within different ingestion pipelines.

Cons

  1. Introduces a new API for tokenizer registration. There may be some potential security risk.
  2. If the user need both tokenizer and embedding model, there would be some duplication.

Open questions

  1. What are other options to make model-based tokenizer available without deploying the text embedding model?