Plugging in a lex/yyac layer into pg_hint_plan

michaelpq commented 1 year ago

Hi all,

I was looking at the proposed patch to allow the parsing of hints in multiple locations: https://github.com/ossc-db/pg_hint_plan/pull/101. FWIW, my opinion about it is that it is not acceptable as it stands.

First, I am pretty sure that there are some dark patterns where that would break depending on the query parsed. Second, this is a duplication of a logic that already exists in Postgres core: src/backend/parser/, mostly around scan.l that has the flex rules. And the core code has been studied and maintained for many years. An issue with the core code is that it is not possible to rely on its rules because anything related to comments is ignored. See this part in scan.l:

<xc>{
{xcstart}       {
                    (yyextra->xcdepth)++;
                    /* Put back any characters past slash-star; see above */
                    yyless(2);
                }

{xcstop}        {
                    if (yyextra->xcdepth <= 0)
                        BEGIN(INITIAL);
                    else
                        (yyextra->xcdepth)--;
                }

{xcinside}      {
                    /* ignore */
                }

So the contents of the comments are simply skipped, and something like core_yylex(), that would retrieve the tokens from the grammar, does not give the possibility to get access to that, either. There are two things that I think can be done here:

Submit a patch to upstream and make the core parser more pluggable for comment contents, which could be done by saving the contents from the comments in a dedicated area, for example. That would be grotty, and I suspect rejection until we have an actual proposal for query hints that could target upstream. If this were to happen, this would be only part of 17.
Introduce a copy of upstream's scan.l for the sake of parsing the hints, including all the rules for escapes, quotes, and such on the way. That's obviously heavy, though scan.l could be made lighter so as we ignore all the contents we don't want, except the hints. Then we'd need to plus in a yacc layer that generates tokens for the hints, based on the contents retrieved from the comments.

I'd like to think that 2. would be the best way to move forward if we were to allow hints in various places. Note that having a proper lex/yacc parser would also simplify a lot pg_hint_plan.c for:

The generation of the hint nodes, switching a list handling, without the need of having all the current code parsing the hint strings.
The code retrieving the hints from the query strings.

At this point, I think that I'd like to hear opinions from @MasahikoSawada @rjuju and @horiguti. Perhaps you have better ideas, I don't know.

rjuju commented 1 year ago

Why would you want to base such an effort on the backend's parser rather than psql / psqlscan.l?

Also, it's unclear to me what would be the comment's scope if provided after the first keyword. If that's supposed to be something smaller than the whole query, shouldn't you actually have to parse the query and maintain some state to know which keyword a comment should be attached to, and eventually which (join|subquery|cte|whatever) scope it belongs to?

michaelpq commented 1 year ago

Why would you want to base such an effort on the backend's parser rather than psql / psqlscan.l?

psqlscan.l would be fine as well, they have the same basics to grab the values. I was wondering whether it would make sense to generate tokens for the hints, FWIW. Perhaps that does not have to be something in the lexer that extracts the hints in the query, though.

Also, it's unclear to me what would be the comment's scope if provided after the first keyword.

I am not sure to understand what you mean here, TBH.

If that's supposed to be something smaller than the wholhttps://github.com/ossc-db/pg_hint_plan/pull/101e query, shouldn't you actually have to parse the query and maintain some state to know which keyword a comment should be attached to, and eventually which (join|subquery|cte|whatever) scope it belongs to?

Some context could make sense if the same aliases are used across the board, but I am not much enthusiastic about adding this much extra facility if there is no direct ask for it. #101 seems to refer also to queries generated by ORMs, with hits in it, causing pg_hint_plan to grab all the hints as a single list, for simplicity.

rjuju commented 1 year ago

psqlscan.l is still simpler way than scan.l IIRC.

I am not sure to understand what you mean here, TBH.

For instance if you have something like that (not using real hint syntax):

WITH s AS (
    /* ForceNestedLoop */
    SELECT * FROM a JOIN b
)
/* ForceHashJoin /*
SELECT * FROM src JOIN c

What are you supposed to do? Is the nestedloop supposed to be local to the CTE or anywhere in the query? If anywhere, are you supposed to use HashJoin everywhere because that's the last hint found, or nested loops everywhere because it's the first hint found?

And FWIW the issue you mention looks like the perfect example of why hints are a bad idea in the first place :(

michaelpq commented 1 year ago

psqlscan.l is still simpler way than scan.l IIRC.

Yes.

What are you supposed to do? Is the nestedloop supposed to be local to the CTE or anywhere in the query? If anywhere, are you supposed to use HashJoin everywhere because that's the last hint found, or nested loops everywhere because it's the first hint found?

Do you think that there is an argument where we'd support such hints in the future? Digging around the Oracle docs, they indeed have cases for such hints.

Anyway, I don't see quite how one would be able to retain some context from the query for each hint given except if the hints are authorized in some pre-defined locations of the query: like just after the keywords SELECT/INSERT/UPDATE/DELETE. There are as well cases like UNION ALL, for example, where we could not really pull a hint from the top of a CTE as you say. This means that we'd need the full gram.y yyac and attach the hints into the parsed state itself. Using your previous example, this would be also weird:

WITH s AS (
    /* ForceNestedLoop */
    SELECT * FROM a1 JOIN b1
    UNION ALL SELECT * FROM a2 JOIN b2
)
/* ForceHashJoin /*
SELECT * FROM src JOIN c;

In the first case, should the CTE hint apply to the first part of the UNION or the second? On the other hand, if we are more restrictive with the location, we could have:

WITH s AS (
    SELECT /* ForceNestedLoop */ * FROM a1 JOIN b1
    UNION ALL SELECT * FROM a2 JOIN b2
)
SELECT /* ForceHashJoin */ * FROM src JOIN c;

At least what applies where would be clear.

rjuju commented 1 year ago

Do you think that there is an argument where we'd support such hints in the future?

I don't know, I still don't understand what exact semantics the user that wrote the original PR wanted.

In any case I don't think it would be wise to implement anything before those questions are answered, which should also probably be "whatever oracle is doing".

This means that we'd need the full gram.y

Yes, that's what I meant when I said it would require to actually parse the query.

michaelpq commented 1 year ago

In any case I don't think it would be wise to implement anything before those questions are answered, which should also probably be "whatever oracle is doing".

I'd guess that this is exactly the correct point: pg_hint_plan should allow semantics as close to Oracle as possible to make migrations to Postgres easier. At least that makes decisions easier to make :)

Yes, that's what I meant when I said it would require to actually parse the query.

Meaning that the maintenance cost would be too high anyway :(

On top of #132, there are two more things that may be worth doing even if we only support for single hint areas for now:

Plug in psqlscan.l or an equivalent with custom rules for (or a new that accepts only a strict "/*+") to save the contents of what's inside in a StringInfo. This has two benefits: easier control about nested hints and comments, removal of the existing parsing logic in get_hints_from_comment() (this code is kind of hard to grasp even now). There's still a cost, still it would be rather minimal. Actually, HEAD would consider as valid hints quoted string values in WHERE quals or such?
Have a second lex/yyac to replace the existing HintParse() callbacks, where we'd gain more readability?

2 does not necessarily need 1, and vice-versa.

I have also noticed that all the *HintParse() routines have a Query as argument, but none of them make use of it. This can be an extra cleanup.

rjuju commented 1 year ago

Fix the bug that makes a hint span over multiple queries if a single query string containing multiple statements is used, a. For instance:
```
LOAD 'pg_hint_plan';
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(id integer, val text);
CREATE INDEX ON t1(id);
VACUUM ANALYZE t1;
```

/+ IndexScan(t1) / EXPLAIN SELECT FROM t1 WHERE id = 1\; EXPLAIN SELECT FROM t1 WHERE id = 2; EXPLAIN SELECT * FROM t1 WHERE id = 3;

which incorrectly yields: [...] QUERY PLAN

Index Scan using t1_id_idx on t1 (cost=0.12..8.14 rows=1 width=36) Index Cond: (id = 1) (2 rows) QUERY PLAN

Index Scan using t1_id_idx on t1 (cost=0.12..8.14 rows=1 width=36) Index Cond: (id = 2) (2 rows) QUERY PLAN

Seq Scan on t1 (cost=0.00..0.00 rows=1 width=36) Filter: (id = 3) (2 rows)

michaelpq commented 1 year ago

Fix the bug that makes a hint span over multiple queries if a single query string containing multiple statements is used, a. For instance:

This one has made me smile, nice. So yes, it looks like we'd better have a light copy of psqlscan.l or equivalent and grab the hints from the lexer.

(Apologies, I have somewhat messed up with your previous comment while replying back on the thread). This is now fixed.

michaelpq commented 1 year ago

I have been looking at psqlscan.l, and it should be OK to plug in that directly to pg_hint_plan, with a few simplifications:

No need to handle backslash commands.
No need to handle variables.
No need for the prompt handling, simplifying a lot psql_scan and its State, so we should
There should be no need for psql_scan() at all, if I'm correct.
Most of the states are things we don't really care about, though it looks like we should care about the xc comments to check for nesting patterns.

@rjuju Would you agree to bring in a copy of psqlscan.l with some parts grabbed from its headers? There are no need for callbacks, for example. And I think that we could just stop scanning the lexer after we find the first hint with some specific rule that match "/*+", superseeding the ones of . Then using yylex() should be enough to do the work, saving the hint string from a xcinside.

This creates an extra long-term cost, but it really looks like the benefits we'd gain from a proper parsing of the hints from the query are much larger than that.

michaelpq commented 1 year ago

It took me some time to get back to it, but I have been able to implement this stuff with the following patch: https://github.com/ossc-db/pg_hint_plan/pull/138.

michaelpq commented 10 months ago

This stuff is finally done on HEAD. It took me a few months of coming back-and-forth to it, but that's now done.

ossc-db / pg_hint_plan