Open Jon-AtAWS opened 2 months ago
@Jon-AtAWS issue seems to be with ML Commons plugin. I cannot transfer the issue to MLCommons. can you please create a issue with ML Commons.
Actually, I maybe figured it out. I got confused by the "512" which was the dimensions of the model. Actually, I believe the error is on the size of the chunk.
doc['chunk'] = ' '.join(doc['chunk'].split()[:500])
And I'm not seeing the problem any more
We can make this an enhancement request instead for a better error.
@Jon-AtAWS Hi, could you provide your request body of registering model and creating pipeline so we can try to reproduce? Because the pretrained model is with the truncation itself so the size should not be the problem.
Hi @xinyual,
I think the model was: "huggingface/sentence-transformers/distiluse-base-multilingual-cased-v1". It might have been "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b" (in some runs, it was.
def _make_register_model_call(self, model_name):
data={
"name": model_name,
"version": "1.0.1",
"model_format": "TORCH_SCRIPT",
"model_group_id": self.model_group_id
}
resp = requests.post('https://localhost:9200/_plugins/_ml/models/_register',
data=json.dumps(data),
auth=(self.admin_user, self.admin_password),
verify=False,
headers={"Content-Type": "application/json"})
if resp.status_code >= 300:
logging.error(f'_register call for {model_name} returned bad status {resp.status_code}\n{resp.reason}')
raise(Exception(resp.text))
return (json.loads(resp.text))['task_id']
The pipeline config is like this:
def _pipeline_config(self, pipeline_field_map=None):
if not pipeline_field_map:
pipeline_field_map = {'chunk': 'chunk_vector'}
config = {
"description": "Pipeline for processing chunks",
"processors" : [
{
"text_embedding": {
"model_id": f'{self.model_id()}',
"field_map": pipeline_field_map
}
}
]
}
logging.info(config)
return config
def _add_neural_pipeline(self, pipeline_name='', pipeline_field_map=None):
if not pipeline_name:
raise Exception('add_neural_pipeline: pipeline name must be specified')
pipeline_config = self._pipeline_config(pipeline_field_map=pipeline_field_map)
logging.info('Adding neural pipeline...')
self.os_client.ingest.put_pipeline(pipeline_name, body=pipeline_config)
And the full kNN setup was this:
def setup_for_kNN(self, index_name='', index_settings='', pipeline_name=None, pipeline_field_map=None,
model_name='', model_dimensions=0):
logging.info(f'Setup for KNN; ml model: {self.ml_model}; ml model group: {self.ml_model_group}')
self.index_name = index_name
self.pipeline_name = pipeline_name
self.os_client.cluster.put_settings(body={"persistent" : {"plugins.ml_commons.only_run_on_ml_node" : "true"}})
self.os_client.cluster.put_settings(body={"persistent" : {"plugins.ml_commons.model_access_control_enabled" : "false"}})
self.os_client.cluster.put_settings(body={"persistent" : {"plugins.ml_commons.allow_registering_model_via_url" : "true"}})
self.ml_model_group = MLModelGroup(self.os_client, self.ml_commons_client, admin_user=self._admin_user,
admin_password=self._admin_password)
time.sleep(1)
self.ml_model = MLModel(self.os_client, self.ml_commons_client, self.ml_model_group.model_group_id(),
model_name=model_name, model_dimensions=model_dimensions,
admin_user=self._admin_user, admin_password=self._admin_password)
self.clean_create_index(index_name=index_name, settings=index_settings)
self._add_neural_pipeline(pipeline_name=pipeline_name, pipeline_field_map=pipeline_field_map)
MLModel and MLModelGroup are classes that wrap the calls to OpenSearch
And could you provide the model and version when there is an error? If it is huggingface/sentence-transformers/msmarco-distilbert-base-tas-b
and version 1.0.1
, it is because we don't have truncation. We fix that in version 1.0.2
. I notice you set the version to 1.0.1
, so please check if we use correct version, is there an error here?
Ah! That must be the error. What I send to the _ml API is
data={
"name": model_name,
"version": "1.0.1",
"model_format": "TORCH_SCRIPT",
"model_group_id": self.model_group_id
}
The versioning is confusing... can we support a "$latest" parameter or similar? That way, I can hard-wire a version for production, but can always test with the latest. It's a good thing most models have version 1.0.1... It never errored, so I never paid attention to the version parameter.
By the way, my above fix is not right, I needed to do this:
def create_max_500_character_string(s):
ret = ""
words = s.split(' ')
for word in words:
if len(ret) + len(word) > 500:
break
ret = f'{ret} {word}'
return ret
Which is working. So, looks like the truncation has to be characters, not tokens.
What is the bug? Frequent Torch errors in the OpenSearch log during bulk upload while using
distiluse-base-multilingual-cased-v1
The weird thing is that I'm seeing different dimensions in the error
How can one reproduce the bug? Steps to reproduce the behavior:
I'm running data from the Amazon Product Q&A data set through the
_bulk
API. Other models work fine so there's an existence proof that the code is not at fault.Copy-pasting a bunch of code here.
What is your host/environment?
Sample error: