[FEATURE] Fail documents who's embedding field is larger than the token limit of an embedding model

opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.

Apache License 2.0

88 stars 121 forks source link

[FEATURE] Fail documents who's embedding field is larger than the token limit of an embedding model #2466

Open dtaivpp opened 2 months ago

dtaivpp commented 2 months ago

Is your feature request related to a problem? Yes! I just had a discussion with @ylwu-amzn where we were discussing how documents are embedded. I (and many others I have talked to) were under the impression that when you send a document larger than the token input of a model there was something like pooling going on under the hood. This seems to not be the case however and documents larger than the token limit are simply truncated.

What solution would you like? There should be a flag to enable/disable document truncation. Transparently truncating data as it's being embedded has catastrophic consequences. Documents that are over the limit may simply never be returned depending on where the document was truncated.

This should probably be configurable via the ML commons settings. We may also want to enable pooling as an alternative. Eg:

PUT _cluster/settings
{
  "persistent": {
    "plugins": {
      "ml_commons": {
        "embedding_auto_truncation": "true",
        "embedding_pooling": "false"
      }
    }
  }
}

What alternatives have you considered? I am not sure how else we can protect people from the misunderstanding that embedding models have a maximum input token length.

Do you have any additional context? Add any other context or screenshots about the feature request here.

ylwu-amzn commented 2 months ago

@dtaivpp Thanks David, we will do some research and enhance this part.

ylwu-amzn commented 1 month ago

@dtaivpp I think our document already shows whether the pre-trained model does Auto-truncation or not https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/ . Do you suggest disable auto-truncation for all pre-trained models by default, then ask cx to enable it manually through some setting ? That seems adding more steps for some cx who already know this and they want auto-truncation.

@dylan-tong-aws , Dylan, do you have any suggestion?

dtaivpp commented 1 month ago

In talking to several customers new to OpenSearch they never came across this documentation. They simply attempted to ingest documents into OpenSearch not realizing that it was truncating their text. This caused them huge relevance issues (they were ingesting documents several thousand tokens long).

I've come across this already with at least 10 people in the last few months (myself included). I feel it would be better to disable auto-truncation and error with a reference to the setting to enable auto-truncation. This way users know the setting is there and can understand their documents need to be chunked.

ylwu-amzn commented 1 month ago

@dtaivpp Do you know which document people read to learn how to use the model ? The easiest way is we can add a big warning to the top of the doc.

Disabling auto-truncate will need code change. Actually I have some concern that may make some other users unhappy as they will need extra step to enable auto-truncate now. How about we add warning to documentation first as it's easy ? Prefer to check other people's opinion for disabling auto-truncate

dtaivpp commented 1 month ago

I think the key here is people experienced with vector search will expect there is an auto-truncation setting they may need to enable. People inexperienced with vector search will not know there is a auto-truncation setting they need to disable.

We need to help the people who have no experience learn. The problem with updating documentation is we would need to ensure all documentation for the rest of time will have this warning. Any new documentation around vector search would need it. New tutorials. Then we need to be concerned that all users who create blogs showing vector search also include this warning. That is unsustainable.

I agree that this will cause some irritation for users already using it in this manner but that will be a one time pain. We can get in front of that by writing some content with the warning that will be thrown to help explain why we made the change.

dtaivpp commented 1 month ago

Actually now that I think about it there are 2 other alternatives to auto-truncation but they both would create a fair amount of work and communication.

Auto-Chunking - we could automatically create/chunk documents into nested fields.
Pooling embeddings - This is my least favorite as it could be just as deceiving as auto-truncation but it would at least ensure the entire document was represented.

When is the next ML triage meeting? I'd like to hear some other opinions on this as well.

ylwu-amzn commented 1 month ago

Agree that it's better for user to use auto chunking, we have released text chunking https://opensearch.org/docs/latest/search-plugins/text-chunking/

We have bi-weekly ML triage meeting. We just had one this Tuesday. So next one will be the week after next.

yuye-aws commented 1 month ago

Hi @dtaivpp @ylwu-amzn ! Chunking is a good solution if user's text is mostly 2x longer than the model token limit. We currently support text chunking as an ingest processor. With a configured pipeline, text chunking algorithms will be applied to the input field upon ingestion. Can you elaborate more on auto chunking?

dhrubo-os commented 1 month ago

@dtaivpp if you want to use any model which is not performing any auto-truncation, you can use huggingface/sentence-transformers/msmarco-distilbert-base-tas-b with 1.0.1 version. Previously we released 1.0.1 version and customer wanted to have auto truncation on that model too. So we bumped to the 1.0.2 version of that model which has auto-truncation now.

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
  "version": "1.0.1",
  "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
  "model_format": "TORCH_SCRIPT"
}

Please let us know if customer want any specific model with truncation feature disabled?

ylwu-amzn commented 1 month ago

I remember we had a PoC to allow cx to specify auto_truncate in model config , will that help @dtaivpp . Cx can control whether they want to auto truncate or not .

dtaivpp commented 1 month ago

@ylwu-amzn and @dhrubo-os the trouble here is not whether it's configurable its about user understanding. If someone is new to vector search this is a really easy way to shoot yourself in the foot and never know what went wrong. I've talked to several people already who have completely abandoned vector search because they thought the relevancy was bad. Turns out their documents were just truncated because they didn't understand how vectorization worked.

ylwu-amzn commented 1 month ago

The easiest way now is clarifying this on documentation. Created on issue https://github.com/opensearch-project/documentation-website/issues/7365