Closed b4sjoo closed 1 year ago
CCI
so that interns can pick this task. This happens because, the values of truncation
is null
in tokenizer.json of sentence-transformers/msmarco-distilbert-base-tas-b model.
To solve this bug, we need to update this method, specifically here.
When we save the tokenizer.json
in our file system, we need to check the value of the truncation
field. If it is null
, then we need to inject some value in the truncation
field.
"truncation": {
"direction": "Right",
"max_length": 128,
"strategy": "LongestFirst",
"stride": 0
}
max_length
will be dynamic depending on the model, but other values can be static.
We inject this json structure in the truncation
field of the tokenizer.json
file and then save the file.
This PR has merged to update the documentation. Does it mean that this issue is already resolved?
This PR has merged to update the documentation. Does it mean that this issue is already resolved?
No, for now we updated the documentation. But we want to solve the bug in our end as well. So this issue will be closed when we solve the bug.
@dhrubo-os Thank you for assigning. Working!
Hey, I think this issue is not reproducible longer. Text with the length more than the threshold is processed as expected. Steps to reproduce:
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/msmarco-distilbert-base-tas-b",
"version": "1.0.1",
"description": "This is a port of the DistilBert TAS-B Model to sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.",
"model_task_type": "TEXT_EMBEDDING",
"model_format": "TORCH_SCRIPT",
"model_content_size_in_bytes": 266352827,
"model_content_hash_value": "acdc81b652b83121f914c5912ae27c0fca8fabf270e6f191ace6979a19830413",
"model_config": {
"model_type": "distilbert",
"embedding_dimension": 768,
"framework_type": "sentence_transformers",
"all_config": "{\"_name_or_path\":\"old_models/msmarco-distilbert-base-tas-b/0_Transformer\",\"activation\":\"gelu\",\"architectures\":[\"DistilBertModel\"],\"attention_dropout\":0.1,\"dim\":768,\"dropout\":0.1,\"hidden_dim\":3072,\"initializer_range\":0.02,\"max_position_embeddings\":512,\"model_type\":\"distilbert\",\"n_heads\":12,\"n_layers\":6,\"pad_token_id\":0,\"qa_dropout\":0.1,\"seq_classif_dropout\":0.2,\"sinusoidal_pos_embds\":false,\"tie_weights_\":true,\"transformers_version\":\"4.7.0\",\"vocab_size\":30522}"
},
"created_time": 1676073973126,
"url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}
GET /_plugins/_ml/tasks/<task_id>
POST /_plugins/_ml/models/<model_id>/_load
POST /_plugins/_ml/_predict/text_embedding/<model_id>
{
"text_docs":[ "AsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadad"],
"return_number": true,
"target_response": ["sentence_embedding"]
}
Hey, I think this issue is not reproducible longer. Text with the length more than the threshold is processed as expected. Steps to reproduce:
POST /_plugins/_ml/models/_upload { "name": "sentence-transformers/msmarco-distilbert-base-tas-b", "version": "1.0.1", "description": "This is a port of the DistilBert TAS-B Model to sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.", "model_task_type": "TEXT_EMBEDDING", "model_format": "TORCH_SCRIPT", "model_content_size_in_bytes": 266352827, "model_content_hash_value": "acdc81b652b83121f914c5912ae27c0fca8fabf270e6f191ace6979a19830413", "model_config": { "model_type": "distilbert", "embedding_dimension": 768, "framework_type": "sentence_transformers", "all_config": "{\"_name_or_path\":\"old_models/msmarco-distilbert-base-tas-b/0_Transformer\",\"activation\":\"gelu\",\"architectures\":[\"DistilBertModel\"],\"attention_dropout\":0.1,\"dim\":768,\"dropout\":0.1,\"hidden_dim\":3072,\"initializer_range\":0.02,\"max_position_embeddings\":512,\"model_type\":\"distilbert\",\"n_heads\":12,\"n_layers\":6,\"pad_token_id\":0,\"qa_dropout\":0.1,\"seq_classif_dropout\":0.2,\"sinusoidal_pos_embds\":false,\"tie_weights_\":true,\"transformers_version\":\"4.7.0\",\"vocab_size\":30522}" }, "created_time": 1676073973126, "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true" }
2. `GET /_plugins/_ml/tasks/<task_id>` 3. `POST /_plugins/_ml/models/<model_id>/_load`
POST /_plugins/_ml/_predict/text_embedding/<model_id> { "text_docs":[ "AsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadadAsdsadsadsadasdsadad"], "return_number": true, "target_response": ["sentence_embedding"] }
Hi, thanks for looking at it. So in your provided example, the whole string is being treated as 1 token as there's no separator (space) between two words. I would suggest try with more than 500 words and let me know what do you see.
Hi, @dhrubo-os .
After the call with you, I don't have any problems with importing. However, the kernel of jupyter is died when the following line is running:
model_path = pre_trained_model.save_as_pt(model_id = "sentence-transformers/msmarco-distilbert-base-tas-b", sentences=["for example providing a small sentence", "we can add multiple sentences"])
I just want to download the model and upload to the server, however it's not possible to execute save_as_pt
function. What can be the case? I was planning to recover my machine. Can it help?
What problem were you facing? Can you please share the error log with me?
The kernel is died when the last line is running
Do you also have same file path: /Volumes/workplace/upload_content/
. You might need to change this based on your file path.
Are you saying when you run this code block your kernel starts showing this message?
Exactly, I also created the following folders /Volumes/workplace/upload_content/
. It doesn't work.
Is there another way of uploading model to the server?
save_as_pt
doesn't upload model to the server.
save_as_pt
only traces the model into torchScript format. Then we use register_model function to upload this model to the opensearch cluster.
So we need to check why save_as_pt
is not working. As without this function we can't upload newly traced function (with your truncation parameter added to the tokenizer.json file). I'm just wondering if you are facing out of memory issue in your end?
Nobody ever faced kernel restart issue during running this function. So I'm wondering if your computer has enough memory?
Did you check if you can run other functions without any issue?
Hi, the problem was really because of memory. Thank you @dhrubo-os !
When I am uploading the saved torchScript model in Opensearch, I see the following error. What can be the reason?
"persistent" : {
"plugins.ml_commons.only_run_on_ml_node" : false,
"plugins.ml_commons.native_memory_threshold" : 100,
"plugins.ml_commons.max_model_on_node": 20,
"plugins.ml_commons.enable_inhouse_python_model": true
}
}
Can you please enable these settings?
From the Dashboard dev tools you can run this:
PUT _cluster/settings
{
"persistent" : {
"plugins.ml_commons.only_run_on_ml_node" : false,
"plugins.ml_commons.native_memory_threshold" : 100,
"plugins.ml_commons.max_model_on_node": 20,
"plugins.ml_commons.enable_inhouse_python_model": true
}
}
@Yerzhaisang, were you able to solve this issue?
Hi @dhrubo-os ,
I was able trace the model with torchscript (see screenshot) and test this model using model-id
.
Question - to check the fix by me in save_as_pt
function, should I run jupyter notebook in opensearch-py-ml
folder? Should I remove opensearch-py
, opensearch-py-ml
python modules?
Awesome. You can run the notebook in the root folder of the opensearch-py-ml. What do you mean by removing opensearch-py, opensearch-py-ml python modules? You need to import those modules.
Hey @dhrubo-os ,
I noticed one thing - when I run the jupyter notebook outside the root folder, the model is traced successfully. However, when I run it inside the root, I see the following error:
I registered model group and added its id to the config.json. Is this error familiar for you?
Please check which opensearch version are you using. Let's use 2.7 to make this implementation easier. In 2.8, we introduces model group for access control, which can be bit confusing. So please make sure your opensearch version is 2.7 so that you don't face any model group related issue.
Hey @dhrubo-os , where can I download docker-compose.yml with 2.7 version?
In your current docker-compose.yml, you can replace latest
with 2.7.0
. That should work. Docker might complain about downgrading the version. In that case you might need to delete docker container, volumes.
Try to do a fresh installation with removing all the docker containers.
@dhrubo-os , we made sure that this issue is fixed. https://github.com/Yerzhaisang/opensearch-py-ml/commit/7c6b1b3581135e097204c22b8ffee5b57b5d6d86
However, can we discuss how to make truncation
parameter dynamic for all models?
When we have the model
object from line no 749: model = SentenceTransformer(model_id)
, we can use the model object to get the max sequence length like this: print("Max Seq length:", model.tokenizer.model_max_length)
Please let me know if you have any question.
ok, let's say we made max_length
dynamic. But what about other parameters like direction
, strategy
, stride
?
Keep those variables static value:
"direction": "Right", "strategy": "LongestFirst", "stride": 0
Remember, if truncation is already exists in the tokenizer.json file the we won't update the value. We will update the value only if truncation is null.
Got it, I will check my fix for all models to make sure that truncation
is unchanged if it's not null.
Hey @dhrubo-os ,
Can you look at https://docs.google.com/spreadsheets/d/1pCK0GJLvQfcmQ31475_IQTJzxueVUDJ0W3kkMxLbGkM/edit?usp=sharing? Many models have the same truncation parameters. Is it ok?
Can you look at https://github.com/Yerzhaisang/opensearch-py-ml/commit/03b281ff2b8e47b4fc6c8ce2b6f06ddc3c8f4c70, I checked truncation
parameter in 771 line if it's null. I also tested my implementation, tas-b model inference works well with length>1000. The issue is fixed.
Cool, looks good. Please raise a PR with unit test. Thanks.
@dhrubo-os , I added unit test in https://github.com/Yerzhaisang/opensearch-py-ml/commit/8a55e7b42d05e849c90bdd569998ba63f5ff59fd.
Can you look at that please? Do you understand why one test failed? Thanks.
Why not raise a PR to our repo? codecov/patch is showing that you have less coverage (test) in code. With looking at the test code, I think we should assert the max length value of the tokenization to make sure we are updating accordingly. Currently you are just asserting that tokenization value is null which doesn't quite reflect the effect of your code change.
What is the bug?
Our pre-trained tas-b model won't accept a doc with a token length exceeding 512
How can one reproduce the bug?
Simply upload our tas-b model (in ONNX form or torch_script form) into OpenSearch cluster and use it to embed a long doc will return you
What is the expected behavior?
Ideally the model is supposed to auto-truncate the doc exceeding a certain length.