Expose node position in input string

Jille commented 1 year ago

I'd like to know the position in the input string each node is at. I'm working on sqlc support for inserting multiple rows and want to find the location of the VALUES clause in the input string to repeat it for each row.

(I know COPY FROM is better, faster and stronger - but it lacks ON CONFLICT)

lfittl commented 1 year ago

Hi @Jille,

Generally speaking, node positions are already tracked - but unfortunately only at the level of individual A_Const nodes. Here is an example with the Ruby library (same approach applies to Go, its just easier to prototype for me):

q = "INSERT INTO x VALUES (1,2), (3, 4)"
PgQuery.parse(q).tree.stmts[0].stmt.insert_stmt.select_stmt.select_stmt.values_lists.map {|i| i.list.items.map {|j| j.a_const.location } }
=> [[22, 24], [29, 32]]

For reference the relevant portion of the parse tree (i.e. one item in the VALUES list), looks like this here:

<PgQuery::Node: list: <
  PgQuery::List: items: [
    <PgQuery::Node: a_const: <PgQuery::A_Const: val: <PgQuery::Node: integer: <PgQuery::Integer: ival: 1>>, location: 22>>,
    <PgQuery::Node: a_const: <PgQuery::A_Const: val: <PgQuery::Node: integer: <PgQuery::Integer: ival: 2>>, location: 24>>
  ]
>>

Now I assume it would be helpful in your case if the List node (i.e. above the A_Const nodes), had a location attached - since the individual constant locations will not match the exact start of each row (e.g. in this example position 22 is the first value "1", instead of the opening parenthesis "(").

However, short of a patch to the Postgres portion of the parser, this can't be easily added. What's more, unfortunately the Node element in question (List), is very commonly used - its also special in the sense that whilst its a valid parse node, the node definition is kept separate (directly in pg_list.h), and its often used in contexts outside the parse tree - so its very unlikely that an upstream patch would be accepted that adds a location field to that node.

Now we could potentially carry an additional patch to the Postgres source (we already have a small set), that adds location to the List node - my main worry there is that we're making all parser outputs a lot bigger (though we could mitigate that by gating this via an extra flag), and that we have one more patch to rebase for each new version. So this would be possible, but time intensive with a small amount of long-term maintenance overhead.

Thus, overall, I would focus on the first element on each list, and then go back one token to find the actual starting point of the row. You can use the scanner to find the token positions, here again a Ruby example for reference:

PgQuery.scan("INSERT INTO x VALUES (1,2), (3, 4)")[0].tokens
=> [<PgQuery::ScanToken: start: 0, end: 6, token: :INSERT, keyword_kind: :UNRESERVED_KEYWORD>,
 <PgQuery::ScanToken: start: 7, end: 11, token: :INTO, keyword_kind: :RESERVED_KEYWORD>,
 <PgQuery::ScanToken: start: 12, end: 13, token: :IDENT, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 14, end: 20, token: :VALUES, keyword_kind: :COL_NAME_KEYWORD>,
 <PgQuery::ScanToken: start: 21, end: 22, token: :ASCII_40, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 22, end: 23, token: :ICONST, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 23, end: 24, token: :ASCII_44, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 24, end: 25, token: :ICONST, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 25, end: 26, token: :ASCII_41, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 26, end: 27, token: :ASCII_44, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 28, end: 29, token: :ASCII_40, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 29, end: 30, token: :ICONST, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 30, end: 31, token: :ASCII_44, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 32, end: 33, token: :ICONST, keyword_kind: :NO_KEYWORD>,
 <PgQuery::ScanToken: start: 33, end: 34, token: :ASCII_41, keyword_kind: :NO_KEYWORD>]

start: 21, end: 22, token: :ASCII_40 is the parenthesis that starts the first row, start: 28, end: 29, token: :ASCII_40 is the parenthesis that starts the second row, etc.

In Go that data is accessible via the Scan function.

Hope that helps!

Jille commented 1 year ago

Nice, I can make that work. Thanks for the detailed reply Lukas!

pganalyze / pg_query_go

Expose node position in input string #90