opensearch-project / opensearch-py-ml

Apache License 2.0
32 stars 62 forks source link

[FEATURE] Request for new pre-trained multi-lingual models #182

Open adacop-os opened 1 year ago

adacop-os commented 1 year ago

Is your feature request related to a problem? I would like to be able to to easily upload the sentence-transformers/distiluse-base-multilingual-cased-v1 and sentence-transformers/distiluse-base-multilingual-cased-v2 models to my OpenSearch cluster.

What solution would you like? The OpenSearch Project can trace these models and upload them to its artifacts server.

What alternatives have you considered? Tracing the models myself and self-hosting them.

Do you have any additional context? N/A

juntezhang commented 1 year ago

I am also interested in this one, both models.

dhrubo-os commented 11 months ago

We released sentence-transformers/distiluse-base-multilingual-cased-v1 model in torchscript format.

You can register the model now:

POST _plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/distiluse-base-multilingual-cased-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}
reuschling commented 4 months ago

We released sentence-transformers/distiluse-base-multilingual-cased-v1 model in torchscript format.

Is it true that it has 128 input tokens? My tests looks like that any above is truncated, when my passage chunks are bigger. At huggingface they write following: (is it the 'max_seq_length': 128?!)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Same for the other pre-configured multilingual model .

Maybe it is worth to add a column 'input token length' to the table at https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#supported-pretrained-models. The text chunking documentation suggests that all "OpenSearch-supported pretrained models" has input token limit of 512.