michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
976 stars 72 forks source link

Reranker model fails to load (maidalun1020/bce-reranker-base_v1) - no max token length is set #127

Closed Matheus-Garbelini closed 4 months ago

Matheus-Garbelini commented 4 months ago

Hello, when trying to load this specific model: maidalun1020/bce-reranker-base_v1 infinity_emb outputs the following error below. Is there something missing in this model config?

infinity_emb --model-name-or-path maidalun1020/bce-reranker-base_v1 --batch-size 16 --log-level info

INFO     2024-03-06 16:39:18,588 datasets INFO: PyTorch version 2.2.0 available.                                                                                        config.py:58
INFO:     Started server process [130475]
INFO:     Waiting for application startup.
INFO     2024-03-06 16:39:19,255 infinity_emb INFO: model=`maidalun1020/bce-reranker-base_v1` selected, using engine=`torch` and device=`None`                    select_model.py:54
INFO     2024-03-06 16:39:21,842 sentence_transformers.cross_encoder.CrossEncoder INFO: Use pytorch device: cuda                                                  CrossEncoder.py:82
INFO     2024-03-06 16:39:22,181 infinity_emb INFO: No optimizations via Huggingface optimum, it is disabled via env INFINITY_DISABLE_OPTIMUM                     acceleration.py:29
INFO     2024-03-06 16:39:22,182 infinity_emb INFO: Switching to half() precision (cuda: fp16). Disable by the setting the env var `INFINITY_DISABLE_HALF`               torch.py:60
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
INFO     2024-03-06 16:39:22,789 infinity_emb INFO: Getting timings for batch_size=16 and avg tokens per sentence=3                                               select_model.py:81
                 0.00     ms tokenization                                                                                                                                           
                 7.75     ms inference                                                                                                                                              
                 0.00     ms post-processing                                                                                                                                        
                 7.75     ms total                                                                                                                                                  
         embeddings/sec: 2064.93                                                                                                                                                    
ERROR:    Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
    async with self.lifespan_context(app) as maybe_state:
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__
    await self._router.startup()
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
    await handler()
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/infinity_server.py", line 57, in _startup
    app.model = AsyncEmbeddingEngine.from_args(engine_args)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/engine.py", line 45, in from_args
    engine = cls(**asdict(engine_args), _show_deprecation_warning=False)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/engine.py", line 36, in __init__
    self._model, self._min_inference_t, self._max_inference_t = select_model(
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/inference/select_model.py", line 83, in select_model
    loaded_engine.warmup(batch_size=engine_args.batch_size, n_tokens=512)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 97, in warmup
    return run_warmup(self, inp)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/abstract.py", line 105, in run_warmup
    embed = model.encode_core(feat)
  File "/opt/conda/lib/python3.10/site-packages/infinity_emb/transformer/crossencoder/torch.py", line 75, in encode_core
    out_features = self.predict(
  File "/opt/conda/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 332, in predict
    model_predictions = self.model(**features, return_dict=True)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1208, in forward
    outputs = self.roberta(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 803, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1028) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [16, 1028].  Tensor sizes: [1, 514]

ERROR:    Application startup failed. Exiting.
michaelfeil commented 4 months ago

Hey @Matheus-Garbelini , thanks for opening the issue

Looks like there is no max_length attr in the config.json

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  1. How does it work if you send a rerank request using from sentence_transformer import CrossEncoder? Is the max length respected for this model? (I am pretty sure the issue is upstream)
  2. Given that max_length is not in the config, how would you expect the 514 token length to be handled? (model supports only 514)
shenlei1020 commented 4 months ago

Please check the usage of "maidalun1020/bce-reranker-base_v1" in: https://github.com/netease-youdao/BCEmbedding?tab=readme-ov-file#3-based-on-sentence_transformers

from sentence_transformers import CrossEncoder

# init reranker model
model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)

# calculate scores of sentence pairs
scores = model.predict(sentence_pairs)

max_length should be 512.

michaelfeil commented 4 months ago

@shenlei1020 @Matheus-Garbelini Thanks for your comments - excited to see your responses here.

I would avoid overwriting the defaults of the authors model code - it depends on the person publishing the model. In this case, a wrong value was sent on purpose by the engineers behind https://huggingface.co/maidalun1020/bce-reranker-base_v1/discussions/4 . I encourage you to fix things in the future directly in the upstream repos - infinity just optimizes the inference .

https://huggingface.co/maidalun1020/bce-reranker-base_v1/discussions/4/files will solve it.

Matheus-Garbelini commented 4 months ago

@Matheus-Garbelini haha, thanks a lot @michaelfeil. This model is indeed not mine, hence I could just assume it was some upstream config issue, but you confirmed that this was the case.

Currently I'm running infinity with embeddings + reranking models and it's works flawlessly. Regards.