stargate / data-api

JSON document API for Apache Cassandra (formerly known as JSON API)
https://stargate.io
Apache License 2.0
14 stars 16 forks source link

Tables: status of "vectorize" with multiple vectorize columns #1726

Open hemidactylus opened 1 week ago

hemidactylus commented 1 week ago

This is mostly a survey and a note to start future work, I guess. I tried several wicked things with vectorize using 1.0.20 on a dev DB and more than a single vectorize column. (Note: this cannot be tested completely on local Data API because the two-provider case requires usage of KMS, lacking a way to send multiple embedding API keys via header)

Two columns with same provider, model and dimension

Inserting a row with {v1: "blabla", v2: [...]} (i.e. one string and one vector) works

Both vectors: works

Both strings: works (two different embedding vectors got stored for two different texts, as expected)

Two columns with same provider and model, but different dimension

Passing both as strings does not work: the API mistakenly thinks both vectors are the same dim:

The Embedding Provider returned an unexpected response: The Embedding Provider
returned an unexpected response: Embedding provider 'openai' did not return expected
embedding length. Expect: '333'. Actual: '123'
(EMBEDDING_PROVIDER_UNEXPECTED_RESPONSE)

Two columns with same provider, different models, different dimensions

Same error as above (and I suspect at this point even if the dimension did match it would either error or work in the wrong way)

Two columns with different providers

This time, inserting both as strings leads to a 500 Internal Server Error:

Server failed: root cause: (java.lang.IllegalArgumentException) Must be single
embedding provider name, got [openai, jinaAI]. Server error '500 Internal Server
Error' for url 'https://[...]-us-west-2.apps.astra-dev.datastax.com/[...]/<TABLE>

If one of the two is passed as a vector, instead, and the other is a string, depending on which one two things happen:

  1. it works
  2. the dimension-mismatch error seen above is triggered (probably because the dimension is still a table-level setting somewhere and is used for the wrong model)
Yuqi-Du commented 2 days ago

Thanks, Stefano.

For this error part, we certainly needs better error, and not 500.

Server failed: root cause: (java.lang.IllegalArgumentException) Must be single
embedding provider name, got [openai, jinaAI]. Server error '500 Internal Server
Error' for url 'https://[...]-us-west-2.apps.astra-dev.datastax.com/[...]/<TABLE>

Another problem is Data API only supports vectorize multiple fields with same provider and dimension. It fails when user don't do so. In the meantime, we allow users to create table with columns have different vectorize settings. Need discussion for this one.

amorton commented 2 days ago

dec hot fix : better errors, including not supporting the diff dimensions. Split this ticket when we get to work on it. January fix to handle different providers, model, and dimensions in the same table.