Closed tavallaie closed 3 weeks ago
Hi @tavallaie , I think we could add a parameter like vectorize.openai_embedding_model
that changes the default model, but I am not sure if we would want it to apply globally.
When we call vectorize.table()
, embeddings are generated for all rows in the source table using the model specified in the transformer
parameter. Then whenever new records are inserted or existing rows are updated, embeddings are computed also using the same model specified, (the model is stored in the vectorize.job
table). Then when vectorize.search()
is called, the model is looked up from vectorize.job
and used so that embeddings from the search query are in the same embedding space as embeddings from the source table.
In this case, changing a guc like vectorize.openai_embedding_model
, would you want it to retroactively update all existing embeddings so that calls to vectorize.search()
would use the value from vectorize.openai_embedding_model
? There can be multiple vectorize jobs at any time, so having a single configuration value could get tricky so that it doesnt apply to all projects.
My problem is if I want to use any other models with OpenAI like rest api, I need to define its embedding models too.
I saw Ollama provider and see its use its default embedding, also OpenAI has 3 different embedding model with different pricing. So users who wants use OpenAI should be able to change embedding model and if they do that they should re-embedding all the table.
so I think we need to have more generic approach for using embedding model by adding variables or adding more generic providers from #152
BTW, can we have different config per table (like different embedding or llama provider) in same database?
I'm definitely open to a better way of changing or supporting multiple embedding models on the same data. Below is how you'd set up two OpenAI embeddings models on the same table.
SELECT vectorize.table(
job_name => 'product_search_openai_3_large',
"table" => 'products',
primary_key => 'product_id',
columns => ARRAY['product_name', 'description'],
transformer => 'openai/text-embedding-3-large',
schedule => 'realtime'
);
SELECT vectorize.table(
job_name => 'product_search_openai_ada_002',
"table" => 'products',
primary_key => 'product_id',
columns => ARRAY['product_name', 'description'],
transformer => 'openai/text-embedding-ada-002',
schedule => 'realtime'
);
Then can search the same table by canging the job_name
parameter.
SELECT * FROM vectorize.search(
job_name => 'product_search_openai_3_large',
query => 'accessories for mobile devices',
return_columns => ARRAY['product_id', 'product_name'],
num_results => 3
);
SELECT * FROM vectorize.search(
job_name => 'product_search_openai_ada_002',
query => 'accessories for mobile devices',
return_columns => ARRAY['product_id', 'product_name'],
num_results => 3
);
What we don't have though is the ability to change the embedding model this used in the product_search_openai_ada_002
"job_name", for example.
so if I have OpenAI compatible API, I change OpenAI site URL and after that I will define a transformer like vectorize.table
example you provide?
what about Rag, is it same?
so if I have OpenAI compatible API, I change OpenAI site URL and after that I will define a transformer like vectorize.table example you provide?
thats correct. limitation with this is that when you change vectorize.embedding_service_url
, it will change for all projects. So I'm not sure its currently possible to use BOTH OpenAI and an OpenAI compatible API simultaneously.
what about Rag, is it same?
Same rules apply for RAG (for the vector search part of RAG). RAG uses the vectorize.table()
API during vectorize.init_rag
and vectorize.search()
during the vectorize.rag()
call.
hmm, can we add something like openAI compatible API provider table and for each service make new record like job? in this way we only call that specific job for RAG
Yes I think we can do that on a dedicated table. Or we could put the url and what api format (e.g. openai) right in the project data already on the vectorize.job table. That way each job can have a unique url.
But if we put it on a dedicated "model providers" table or something like you suggested then we can share across projects.
Can you make new issue with your decision and detail scope for enhancement and implementation so we can close this issue?
@tavallaie - great idea. Here's the issue. It is WIP, but please take a look and leave comments :) I will need help with design and implementation if you are interested and available!
So let's close this issue, we can go on with that issue.
I found a problem in the current setup. There is no easy way to change the embedding models for both Ollama and OpenAI. Right now:
Ollama Embedding Model: There is no option to change the embedding model used for Ollama. OpenAI Embedding Model: The OPENAI_BASE_URL is for OpenAI’s API, but it is hardcoded. There is no option to change the embedding model, like from
text-embedding-ada-002
to another one.Suggestion:
Add new variables or settings that allow users to change the embedding model for Ollama and OpenAI.
For example:
vectorize.ollama_embedding_model
to change the Ollama models.vectorize.openai_embedding_model
to switch OpenAI models, liketext-embedding-ada-002.
This will give users more control to choose different models without changing the code directly.
Extra Information:
Right now, the embedding models are fixed in the code, and users cannot change them easily. Adding these options will make it more flexible, like other settings (API keys, URLs) that are already there.