tobymao / sqlglot

Python SQL Parser and Transpiler
https://sqlglot.com/
MIT License
6.78k stars 714 forks source link

Issue with Extracting All Tables and Subqueries from Top-Level FROM Clause #4409

Closed nipilii closed 1 week ago

nipilii commented 1 week ago

Thank you for creating and maintaining sqlglot. It's been incredibly helpful for SQL parsing tasks. I encountered an issue when trying to extract all tables and subqueries from the top-level FROM clause of a query and was hoping you could provide some guidance.

Example Query SELECT t1.col1, t2.col2 FROM table1 t1, table2 t2, (SELECT col1, col2 FROM table2 WHERE col3 = 'A') t3 WHERE t1.col1 = t2.col1;

Desired Output For the above query, I would like to extract all tables and subqueries from the top-level FROM clause:

table1 t1 table2 t2 Subquery t3 with its SQL.

Current Behavior Here’s what I am doing:

parsed = parse_one(query) from_clause = parsed.args.get("from") print("Extracted FROM Clause:", from_clause)

def explore_from_clause(from_clause): if not from_clause: return if isinstance(from_clause, exp.From): items = from_clause.args.get("expressions", []) or [from_clause.args.get("this")] for item in items: if isinstance(item, exp.Table): # Table handling print(f"Table: {item.this}, Alias: {item.alias}") elif isinstance(item, exp.Subquery): # Subquery handling print(f"Subquery Alias: {item.alias}") print(f"Subquery SQL: {item.this.sql()}") else: print(f"Unhandled Expression: {item}") else: print("No valid FROM clause found.")

explore_from_clause(from_clause)

Result: The first table, table1 t1, gets extracted correctly. The subquery t3 can be manually extracted. However, table2 t2 is omitted entirely from the from_clause.

Additional Example Without a Subquery For a simpler query like: SELECT t1.col1, t2.col2, t3.col3 FROM table1 t1, table2 t2, table3 t3;

I only get the first table (table1 t1) when analyzing the from_clause using sqlglot. The other tables (table2 t2 and table3 t3) are omitted.

My Question

  1. How can I reliably extract all tables and subqueries from the top-level FROM clause, including all tables like table2 t2 that currently get skipped?
  2. Is there a better approach or method in sqlglot to handle this scenario, especially for queries with multiple tables or subqueries in the FROM clause?

Any help or guidance would be greatly appreciated!

VaggelisD commented 1 week ago

Hey @nipilii,

Thanks for your kind words! I'd suggest that you:

  1. Go through our AST Primer, which is our main tutorial on how to traverse & extract nodes
  2. When parsing a query, printing the repr of the top-level node to see what's going on under the hood:
>>> import sqlglot
>>> ast = sqlglot.parse_one("SELECT * FROM table1 t1, table2 t2")
>>> print(repr(ast))
Select(
  expressions=[
    Star()],
  from=From(
    this=Table(
      this=Identifier(this=table1, quoted=False),
      alias=TableAlias(
        this=Identifier(this=t1, quoted=False)))),
  joins=[
    Join(
      this=Table(
        this=Identifier(this=table2, quoted=False),
        alias=TableAlias(
          this=Identifier(this=t2, quoted=False))))])

As you can see, the comma separated tables in this case are parsed as JOINs so they'd be stored in the joins arg of the exp.Select, which is on the same level as from on the parsed tree.

I'll go ahead and close this, for any similar/follow up questions you can also join our Slack channel.