Parity break with OpenAI API: /models

michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.

https://michaelfeil.github.io/infinity/

MIT License

1.06k stars 75 forks source link

Parity break with OpenAI API: /models #114

Closed MichaelMcCulloch closed 4 months ago

MichaelMcCulloch commented 4 months ago

Reproduction:

curl -X GET https://infinity.semanticallyinvalid.net/models

Expected: List of model dictionary:

{
  "data": [
    {
      "id": "/thenlper/gte-small",
      "stats": {
        "queue_fraction": 0.0,
        "queue_absolute": 0,
        "results_pending": 0,
        "batch_size": 4096
      },
      "object": "model",
      "owned_by": "infinity",
      "created": 1708973209,
      "backend": "torch"
    }
  ],
  "object": "list"
}

Actual: Dictionary:

{
  "data": {
    "id": "/thenlper/gte-small",
    "stats": {
      "queue_fraction": 0.0,
      "queue_absolute": 0,
      "results_pending": 0,
      "batch_size": 4096
    },
    "object": "model",
    "owned_by": "infinity",
    "created": 1708973209,
    "backend": "torch"
  },
  "object": "list"
}

I have worked around this here, but for this commit to be accepted in the upstream, it would need to adhere to the expected list of models.

michaelfeil commented 4 months ago

Make sense - thanks for the input and pointing this out.

1. PR welcome this should not have a large effect on the integrations.
1. Batch_size is 4096 - I would not recommend setting this that high - you will likley OOM and there is no speedup beyond 128 IME.

MichaelMcCulloch commented 4 months ago

Thank you. Are you able to educate me on why there is no speedup after 128?

michaelfeil commented 4 months ago

You could check out e.g. https://github.com/michaelfeil/infinity/tree/main/docs/benchmarks - for long enough requests, a batch_size of 32 will saturate the GPU usage up to a decent amount, where further vectorization does not bring any benefit when adding more items in the batch dimension. On CPU, even if avx-512 instructions are used, you might not see a decent speedup beyond 4/8.

Feel free to benchmark it (with the included benchmark scripts and report the results here - I would love to learn from them as well!

michaelfeil commented 4 months ago

@MichaelMcCulloch Do you want to PR a fix?