API method to expose the list of referenced tables?

imathews commented 1 year ago

It would be really great if the API exposed a mechanism to get a list of all tables referenced by the query. Right now, I'm using a table visitor to do this, keeping track of any aliases / CTEs to differentiate between table identifiers declared within the query vs those that are externally referenced. However, I'm not sure that my approach is as robust as it could be, and it would be quite helpful if the library exposed a helper to get this information directly.

node-sql-parser provides something similar via the tableList method, as well as the ability to get a list of visited columns (the latter I imagine is also useful, though not immediately relevant to my use case).

I also appreciate that this may be considered out-of-scope for this library and better implemented in userspace.

nene commented 1 year ago

That's definitely out of scope for this library.

Right now, I'm using a table visitor to do this, keeping track of any aliases / CTEs to differentiate between table identifiers declared within the query vs those that are externally referenced.

You're clearly on the right track with this approach. You're most likely already doing a better job at it than the node-sql-parser does.

Anyway... there's no magic info about the tables inside the parser that I could easily expose. I'd have to implement it on top of the parser just as you are currently doing.

If you end up implementing such an utility library, I could add a reference to it into README. (Always nice to have some actual users of the parser besides myself :) ).

imathews commented 1 year ago

That makes total sense. I'll carry forward with my approach, which I think should be reasonable. One of the trickier edge cases will be when a subquery alias exists in one part of the query that uses the same name as a table referenced in another part of the query. But I should definitely have all the information I need within the CST, and that edge case might be uncommon enough to ignore for now for my purposes...

Attaching below what I've hacked together so far, in case it's helpful to others. I definitely need to test and validate more, but it's "working" for a handful of complex SQL statements I threw at it. Though I certainly wouldn't rely on this logic for anything security / table access control related.

Thanks again for your work on this library. It's certainly not lost on me how much effort went into putting it together, and so far it's been very robust in my testing.

const possibleTables = new Set();
const aliasedTables = new Set();

function handlePossibleTableNode(node) {
    if (node.type === 'identifier') {
        possibleTables.add(node.name.toLowerCase());
    } else if (node.type === 'alias') {
        aliasedTables.add(node.alias.name.toLowerCase());
        if (node.expr.type === 'identifier') {
            possibleTables.add(node.expr.name.toLowerCase());
        } else if (node.expr.type === 'bigquery_quoted_member_expr') {
            node = node.expr;
        }
    }
    // TODO: confirm how consistent the shape of this is, e.g. for queries scoped to a default GCP project
    if (node.type === 'bigquery_quoted_member_expr') {
        const arr = [
            node.expr.object?.object?.name,
            node.expr.object?.property?.name,
            node.expr.property.name,
        ].filter((str) => str);
        possibleTables.add(arr.join('.').toLowerCase());
    }
}

const tableVisitor = cstVisitor({
    common_table_expression: (node) => {
        aliasedTables.add(node.table.name.toLowerCase());
    },
    from_clause: (node) => {
        handlePossibleTableNode(node.expr);
    },
    join_expr: (node) => {
        handlePossibleTableNode(node.left);
        handlePossibleTableNode(node.right);
    },
});

tableVisitor(cst);

for (const tableName of aliasedTables) {
    possibleTables.delete(tableName);
}
return [...possibleTables]

nene commented 1 year ago

This brings to my mind a few things:

It would likely be better to implement this functionality on top of an AST, rather than CST. I have also started with a sql-parser-ast library, but that work is still in very early stages. Probably wouldn't make a huge difference for you, just the syntax tree would be a bit less complex.
There's currently no way to control the iteration over nodes within cstVisitor. Like, when you've found a subquery, you might want to say, hey, don't iterate over all child nodes, let me handle it. I'm thinking of adding something similar to what estraverse has, where one can just return a special value from the visitor function, so one can write something like this:

// Count select statements (but ignore subqueries)
let topLevelSelects = 0;
cstVisitor({
  select_stmt: (node) => {
    topLevelSelects++;
    return cstVisitor.SKIP;
  }
});

Finally I notice that I should rename common_table_expression to common_table_expr, so it would be consistent with all other _expr node types.

imathews commented 1 year ago

Thanks, yea I saw the AST library and am keeping an eye on it. Though in my case, I think it'll make sense to preserve the CST, because in some places we need to change identifiers in the query while otherwise preserving whitespace.

Something that could be used to short-circuit cstVisitor would be nice, although not currently a big problem for my approach. Saw the other updates in the last release — thanks, as always!

nene / sql-parser-cst

API method to expose the list of referenced tables? #36