vanna-ai / vanna

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.
https://vanna.ai/docs/
MIT License
12.14k stars 977 forks source link

(FeatureRequest)Get LLM to use specific SQL in the question-sql pair in the chroma database #331

Open jankariwo opened 8 months ago

jankariwo commented 8 months ago

Ive been setting up the chroma database with ddl,sql and business documentation. The vn.generate_sql is not doing very well with generating sql when the question contains company specific terms (which have been added in the documentation). When I used similar1 = vn.get_similar_question_sql(question)....I get 10 results with the first one matching what I have in chroma. So I appended this to the prompt vn.generate_sql(question + similar1[0]).......the results were better.

Feature request: If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

andreped commented 8 months ago

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

zainhoda commented 8 months ago

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

jankariwo commented 8 months ago

@andreped I think the feature request is something like:

If the question is exactly the same as one of the trained question-SQL pairs, then instead of using the LLM to generate the SQL, it should bypass the LLM and just return the SQL that was associated with that question.

That is correct interpretation

jankariwo commented 8 months ago

If I ask the the LLM THESAME question in the question-sql pair that I loaded into chromadb, is there a way to get the LLM to use the corresponding SQL?

Hello, @jankariwo! :]

Im not sure I follow. If you train on a question-sql pair, it will be stored in the chroma vector DB. If you then attempt to ask the same question again, the question-sql pair should be one of the top 10 most similar question-sql pairs. And then, it will be part of the prompt sent to perform the new SQL completion. It it very likely that it will realize that this question is already part of the context and thus knowns which SQL query to produce. So I would expect it more often than not to yield the same SQL query as output.

Are you not seeing this? Or are you maybe seeing that this does not always seem to be the case, but it works most of the time?

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

andreped commented 8 months ago

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

jiangsiYang commented 7 months ago

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it!

I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

jankariwo commented 7 months ago

Yes you are right. You described the EXPECTED behavior, but the LLM still returns the wrong sql. Ive done several iterations but same result...hence the feature request...

I believe I have seen the same, sadly. So great that you are requesting it! I like the idea of simply checking if the SQL is already part of the vector store and using it if there is a perfect match. Should resolve the issue. That can be performed extremely fast and would also improve performance in these cases.

Thank you! So, can I understand it this way: 'correct query' stores questions, and each time a search is conducted, it retrieves the question vectors in the 'correct query' dataset. When the similarity exceeds a certain value, it directly returns the answer corresponding to the 'correct query'?

Yup. It returns the matched query instead of sending a prompt to the LLM to generate the SQL

andreped commented 7 months ago

@zainhoda I assume this feature was not part of the last release? Would be of great interest for our applications as well :]

zainhoda commented 7 months ago

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:

    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']

right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

jankariwo commented 7 months ago

@andreped if you (or anyone else) would like to take a quick stab at this, I think this is relatively simple change.

Add something like this:

    # Iterate through each item in the list
    for item in question_sql_list:
        # Check if the current item's question matches the input question
        if item['question'] == question:
            # If a match is found, return the corresponding sql value
            return item['sql']

right after this line:

https://github.com/vanna-ai/vanna/blob/main/src/vanna/base/base.py#L110

and then maybe add a quick test for this

Thanks @zainhoda. Ive implemented. Added it to my base.py and tested it. I also created my function in vanna class with it to test the outputs. It works perfectly. You just have to lookout for trailing whitespace in the question which might not make them to match. But this can be easily fixed with extra string cleaning code before checking for a match. Thanks again.

andreped commented 7 months ago

@jankariwo Can you make a PR?