Open dtaivpp opened 2 months ago
@dtaivpp Thanks David, we will do some research and enhance this part.
@dtaivpp I think our document already shows whether the pre-trained model does Auto-truncation or not https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/ . Do you suggest disable auto-truncation for all pre-trained models by default, then ask cx to enable it manually through some setting ? That seems adding more steps for some cx who already know this and they want auto-truncation.
@dylan-tong-aws , Dylan, do you have any suggestion?
In talking to several customers new to OpenSearch they never came across this documentation. They simply attempted to ingest documents into OpenSearch not realizing that it was truncating their text. This caused them huge relevance issues (they were ingesting documents several thousand tokens long).
I've come across this already with at least 10 people in the last few months (myself included). I feel it would be better to disable auto-truncation and error with a reference to the setting to enable auto-truncation. This way users know the setting is there and can understand their documents need to be chunked.
@dtaivpp Do you know which document people read to learn how to use the model ? The easiest way is we can add a big warning to the top of the doc.
Disabling auto-truncate will need code change. Actually I have some concern that may make some other users unhappy as they will need extra step to enable auto-truncate now. How about we add warning to documentation first as it's easy ? Prefer to check other people's opinion for disabling auto-truncate
I think the key here is people experienced with vector search will expect there is an auto-truncation setting they may need to enable. People inexperienced with vector search will not know there is a auto-truncation setting they need to disable.
We need to help the people who have no experience learn. The problem with updating documentation is we would need to ensure all documentation for the rest of time will have this warning. Any new documentation around vector search would need it. New tutorials. Then we need to be concerned that all users who create blogs showing vector search also include this warning. That is unsustainable.
I agree that this will cause some irritation for users already using it in this manner but that will be a one time pain. We can get in front of that by writing some content with the warning that will be thrown to help explain why we made the change.
Actually now that I think about it there are 2 other alternatives to auto-truncation but they both would create a fair amount of work and communication.
When is the next ML triage meeting? I'd like to hear some other opinions on this as well.
Agree that it's better for user to use auto chunking, we have released text chunking https://opensearch.org/docs/latest/search-plugins/text-chunking/
We have bi-weekly ML triage meeting. We just had one this Tuesday. So next one will be the week after next.
Hi @dtaivpp @ylwu-amzn ! Chunking is a good solution if user's text is mostly 2x longer than the model token limit. We currently support text chunking as an ingest processor. With a configured pipeline, text chunking algorithms will be applied to the input field upon ingestion. Can you elaborate more on auto chunking?
@dtaivpp if you want to use any model which is not performing any auto-truncation, you can use huggingface/sentence-transformers/msmarco-distilbert-base-tas-b
with 1.0.1
version. Previously we released 1.0.1
version and customer wanted to have auto truncation on that model too. So we bumped to the 1.0.2
version of that model which has auto-truncation now.
POST /_plugins/_ml/models/_register
{
"name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b",
"version": "1.0.1",
"model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2",
"model_format": "TORCH_SCRIPT"
}
Please let us know if customer want any specific model with truncation feature disabled?
I remember we had a PoC to allow cx to specify auto_truncate
in model config , will that help @dtaivpp . Cx can control whether they want to auto truncate or not .
@ylwu-amzn and @dhrubo-os the trouble here is not whether it's configurable its about user understanding. If someone is new to vector search this is a really easy way to shoot yourself in the foot and never know what went wrong. I've talked to several people already who have completely abandoned vector search because they thought the relevancy was bad. Turns out their documents were just truncated because they didn't understand how vectorization worked.
The easiest way now is clarifying this on documentation. Created on issue https://github.com/opensearch-project/documentation-website/issues/7365
Is your feature request related to a problem? Yes! I just had a discussion with @ylwu-amzn where we were discussing how documents are embedded. I (and many others I have talked to) were under the impression that when you send a document larger than the token input of a model there was something like pooling going on under the hood. This seems to not be the case however and documents larger than the token limit are simply truncated.
What solution would you like? There should be a flag to enable/disable document truncation. Transparently truncating data as it's being embedded has catastrophic consequences. Documents that are over the limit may simply never be returned depending on where the document was truncated.
This should probably be configurable via the ML commons settings. We may also want to enable pooling as an alternative. Eg:
What alternatives have you considered? I am not sure how else we can protect people from the misunderstanding that embedding models have a maximum input token length.
Do you have any additional context? Add any other context or screenshots about the feature request here.