timescale / pgai

Bring AI models closer to your PostgreSQL data
PostgreSQL License
315 stars 11 forks source link

Text-to-SQL #24

Open jgpruitt opened 3 weeks ago

jgpruitt commented 3 weeks ago

Experiment with text-to-sql from within the database. Can we embed the pg_catalog and use it to power text-to-sql?

Wang-Yong2018 commented 2 weeks ago

Hope Embedding pg_catalo can improve text to sql be more nice that traditional zeroshot or one-shot mod.

I based on zero shot idea develop a shiny llm app. It has 5 steps from convert catalog to final get data.

  1. convert pg_catalog into create table syntax
  2. get dbms name
  3. add user input.
  4. ask AI(for example) generate slq
  5. get data via AI based SQL

For example, I want to know "find the top 5 sold music name and total revenue and quantity as well as average unit price", the ai will generate sql for me basedon schema and question, I extract data using the ai sql.

image

Below is the detail steps.

  1. convert the pg_catalog into sql-ddl sentence, we can

    "select
                        'CREATE TABLE ' || nspname || '.' || relname || ' (' || chr(10)||
                         array_to_string(
                            array_agg(attname || 
                                      ' '     || 
                                      atttypid::regtype::text||
                                      CASE attnotnull WHEN true THEN ' NOT NULL' ELSE ' NULL' end ||
                                      coalesce('  -- COMMENT ' || quote_literal(description), '')|| 
                                      ','  || chr(10)
                                      ), '  '  
                                      )  || chr(10)|| ');' as definition
                      from 
                          pg_attribute
                          join
                              pg_class on
                            pg_class.oid = pg_attribute.attrelid
                          join
                              pg_namespace on
                            pg_namespace.oid = pg_class.relnamespace
                          left join
                              pg_description on
                            pg_description.objoid = pg_class.oid
                            and pg_description.objsubid = pg_attribute.attnum
                      where
                            nspname not in ('pg_catalog', 'information_schema')
                            and relkind in ('r', 'v')
                      group by
                        nspname, relname;"
  2. to get dbms name, as this is only postgres. we can say it is constranst

  3. system prompt for text to sql


/* 1. Given the following database schema: */
{sql_ddl}

/* 2. Answer user question followed :
{user_question}

/* 3. Follow below SQL code standard
 - 3.1. To begin with, check if the question can be answered by SQL, return explanation about how to change question.
 - 3.1.1 explanation should be comments with SQL single line comments format!!!
 - 3.2. If the question could be answer by SQL, use following format:
 - 3.2.1 Prefer to use common table expression.
 - 3.2.2 Only use the tables and fields defined in the database schema
 - 3.2.3 Generate single table query when possible.
 - 3.2.4 Only left join tables which have same field name and nor more than 3 tables join 
 - 3.2.5 If necessary,explaination code geneation  with SQL single line comments format
 - 3.2.6 the DBMS is {dbms_name}. if DBMS is sqlite, pls only use SQL-92 standard.
 - 3.2.7 the SQL code should be ending with a semicolon.
*/
  1. one-shot or few shot it is get data sample for each of the table records to enable large lanage model to understandard the data. as LLM has context input limit, so, ignore this step, except llm suport 16K or more input prompt.

=======================================================

Attache my shiny_llm_map git link and demo line for your reference

  1. demo link https://[r7prz5-yong-wang.shinyapps.io/shiny_llm_map/](https://r7prz5-yong-wang.shinyapps.io/shiny_llm_map/)
  2. git link https://github.com/Wang-Yong2018/shiny_llm_map