SQL generated by LLM be limited to the selected training data

tzh5477 commented 3 months ago

Due to data security control requirements, can the SQL generated by LLM be limited to the selected training data instead of all the trained data?

andreped commented 3 months ago

Im not sure I understand what you are asking. In terms of sql-training, you feed question-sql pairs and store embeddings for each in a vector store. Then the LLM will use a subset of these question-sql pairs to build the dynamic context. For this retrieval, all pairs in the vector store can be used.

I guess what you want is for some applications, only to make a smaller set of the vector store available to the application? Why would you do this? Why don't you just make two separate vector stores, and then for application A you can use one vector store and application B use another?

Anyways, if you don't want to make certain data available in application A or B, you can just setup a different DB user with access to different tables. That way, if the LLM were to try to make an SQL to retrieve data from parts of the DB that it cannot access, it will not be able to.

What are you trying to do?

tzh5477 commented 3 months ago

You understanding are correct and the solution is good .

I'm a newbie, how can I create and use different vector stores? the relevant code?

Im not sure I understand what you are asking. In terms of sql-training, you feed question-sql pairs and store embeddings for each in a vector store. Then the LLM will use a subset of these question-sql pairs to build the dynamic context. For this retrieval, all pairs in the vector store can be used.

I guess what you want is for some applications, only to make a smaller set of the vector store available to the application? Why would you do this? Why don't you just make two separate vector stores, and then for application A you can use one vector store and application B use another?

Anyways, if you don't want to make certain data available in application A or B, you can just setup a different DB user with access to different tables. That way, if the LLM were to try to make an SQL to retrieve data from parts of the DB that it cannot access, it will not be able to.

What are you trying to do?

andreped commented 3 months ago

I'm a newbie, how can I create and use different vector stores? the relevant code?

But do you want multiple vector stores or multiple DB users with different rights using the same vector store? I am not sure what is most optimal for your use case. But in general, if data privacy is important, you likely want multiple DB users with different rights. And then when communicating with one part of the DB, you use connection A, and then for the other you use another connection B.

Easiest way to solve this is likely to just setup two separate vanna instances (with different DB connectors): https://github.com/vanna-ai/vanna?tab=readme-ov-file#import

And then in your application, you handle yourself which user can access which vanna instance (and in turn which part of the DB). Does that make sense?

Could also be that instead of having completely separate Vanna instances, you can just override the run_sql method like so:

vanna_instance.run_sql = execute_sql_query_to_dataframe_method_application_A

But this might be problematic if you have multiple users using the same vanna_instance. You dont want to accidentally give user B access to data in connection A.

tzh5477 commented 3 months ago

I'm a newbie, how can I create and use different vector stores? the relevant code?

But do you want multiple vector stores or multiple DB users with different rights using the same vector store? I am not sure what is most optimal for your use case. But in general, if data privacy is important, you likely want multiple DB users with different rights. And then when communicating with one part of the DB, you use connection A, and then for the other you use another connection B.

Easiest way to solve this is likely to just setup two separate vanna instances (with different DB connectors): https://github.com/vanna-ai/vanna?tab=readme-ov-file#import

And then in your application, you handle yourself which user can access which vanna instance (and in turn which part of the DB). Does that make sense?

Could also be that instead of having completely separate Vanna instances, you can just override the run_sql method like so:
vanna_instance.run_sql = execute_sql_query_to_dataframe_method_application_A
But this might be problematic if you have multiple users using the same vanna_instance. You dont want to accidentally give user B access to data in connection A.

multiple database users with different rights using the same vector store.

I don't think overriding the run_sql is a good solution.

andreped commented 3 months ago

multiple database users with different rights using the same vector store.

I don't think overriding the run_sql is a good solution.

Then you need one vanna instance per DB user. That should be rather straight forward to implement. Let me know how it goes :]

tzh5477 commented 3 months ago

I came up with a new solution .

Add a new column to the vector store to distinguish different user groups , based on this field, we can limit the scope of vector searches when users ask questions.

vanna-ai / vanna

SQL generated by LLM be limited to the selected training data #317