vanna-ai / vanna

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.
https://vanna.ai/docs/
MIT License
11.01k stars 858 forks source link

Fix regex in `extract_sql` to correctly handle SQL containing CTEs (without matching WITHIN) #512

Closed mahendrasinghbora closed 3 months ago

mahendrasinghbora commented 3 months ago

Pull Request Description:

This pull request addresses an issue in the extract_sql function where the regex used to extract SQL queries from LLM responses incorrectly matches the keyword "WITHIN". This results in incomplete and invalid SQL queries being extracted. The regex has been updated to use word boundaries to ensure accurate matching of the "WITH" keyword at the start of CTEs. This fix resolves the syntax errors encountered when running the extracted SQL queries.

Related Previous Pull Requests:

Issue: When a response contains a SQL query with the keyword "WITHIN", the regex mistakenly captures partial SQL statements, resulting in invalid SQL. An example of such an SQL query is provided below:

Example SQL:

SELECT
    CASE
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) < 20 THEN '< 20'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 20 AND 29 THEN '20-29'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 30 AND 39 THEN '30-39'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 40 AND 49 THEN '40-49'
        ELSE '50+'
    END AS age_group,
    COUNT(DISTINCT customer.customer_id) AS customer_count,
    SUM(invoice.invoice_amount) AS total_spent,
    COUNT(invoice.invoice_id) AS total transactions,
    AVG(invoice.invoice_amount) AS average spent per transaction,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY invoice.invoice_amount) AS median_spent
FROM
    customer
    LEFT JOIN invoice ON customer.customer_id = invoice.customer_id
GROUP BY
    age_group;

Extracted SQL before fix:

WITHIN GROUP (ORDER BY invoice.invoice_amount) AS median_spent
FROM
    customer
    LEFT JOIN invoice ON customer.customer_id = invoice.customer_id
GROUP BY
    age_group;

Solution: The regex was updated to use word boundaries to correctly match the keyword "WITH" at the start of a CTE, ensuring accurate extraction of SQL statements.

Updated regex:

sqls = re.findall(r"\bWITH\b .*?;", llm_response, re.DOTALL)

The current fix ensures that SQL extraction handles CTEs properly without misinterpreting "WITHIN" as a CTE starter.