universal-ctags / ctags

A maintained ctags implementation
https://ctags.io
GNU General Public License v2.0
6.56k stars 625 forks source link

SQL parser - Oracle PL/SQL - no symbols generated after inquiry directive ($$name) #3006

Closed bagl closed 3 years ago

bagl commented 3 years ago

SQL parser produces no tags after encounter with PL/SQL directive ($$name)

Here's the failing test case

diff --git a/Units/parser-sql.r/bug-plsql-directive.sql.d/args.ctags b/Units/parser-sql.r/bug-plsql-directive.sql.d/args.ctags
new file mode 100644
index 00000000..3eebe642
--- /dev/null
+++ b/Units/parser-sql.r/bug-plsql-directive.sql.d/args.ctags
@@ -0,0 +1,3 @@
+--sort=no
+--extras=+q
+--kinds-SQL=+d
diff --git a/Units/parser-sql.r/bug-plsql-directive.sql.d/expected.tags b/Units/parser-sql.r/bug-plsql-directive.sql.d/expected.tags
new file mode 100644
index 00000000..4ed48eb9
--- /dev/null
+++ b/Units/parser-sql.r/bug-plsql-directive.sql.d/expected.tags
@@ -0,0 +1,5 @@
+demo_pkg       input.sql       /^create or replace package body demo_pkg is$/;"        P
+demo_pkg.test_var      input.sql       /^test_var varchar2(64) := $$PLSQL_UNIT;$/;"    v       package:demo_pkg
+test_var       input.sql       /^test_var varchar2(64) := $$PLSQL_UNIT;$/;"    v       package:demo_pkg
+demo_pkg.test_func     input.sql       /^function test_func return varchar2$/;"        f       package:demo_pkg
+test_func      input.sql       /^function test_func return varchar2$/;"        f       package:demo_pkg
diff --git a/Units/parser-sql.r/bug-plsql-directive.sql.d/input.sql b/Units/parser-sql.r/bug-plsql-directive.sql.d/input.sql
new file mode 100644
index 00000000..85ce5565
--- /dev/null
+++ b/Units/parser-sql.r/bug-plsql-directive.sql.d/input.sql
@@ -0,0 +1,17 @@
+// Fails to create tags after PLSQL inquiry directive is used
+// https://docs.oracle.com/en/database/oracle/oracle-database/18/lnpls/plsql-language-fundamentals.html#GUID-E918087C-D5A8-4CEE-841B-5333DE6D4C15
+//
+// 'parseDollarQuote' causing troubles?
+//
+create or replace package body demo_pkg is
+
+test_var varchar2(64) := $$PLSQL_UNIT;
+
+function test_func return varchar2
+as
+begin
+    return test_var;
+end;
+
+end demo_pkg;

Test case result showing the missing tags after directive

--- ./Units/parser-sql.r/bug-plsql-directive.sql.d/expected.tags        2021-05-05 15:59:16.700000000 +0200
+++ /home/petr/devel/ctags/Units/parser-sql.r/bug-plsql-directive.sql.d/FILTERED.tmp    2021-05-05 16:54:58.160000000 +0200
@@ -4,2 +3,0 @@
-demo_pkg.test_func     input.sql       /^function test_func return varchar2$/;"        f       package:demo_pkg
-test_func      input.sql       /^function test_func return varchar2$/;"        f       package:demo_pkg

The PL/SQL directive is a token of two dollar signs followed by alpha and possibly more alphanum or #$ characters (regex: `\$\$[A-Za-z][A-Za-z0-9$#]*). This definition will probably collide with PostgreSQL [dollar quoted strings](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-DOLLAR-QUOTING) such as$$Dianne's horse$$or$SomeTag$Dianne's horse$SomeTag$` implemented here.

The version of ctags:

$ ctags --version
Universal Ctags 5.9.0(45cd9088), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: May  5 2021, 09:15:34
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +iconv, +option-directory, +packcc, +optscript

How do you get ctags binary: Building it locally

masatake commented 3 years ago

Thank you for reporting and the analysis. I agree with your analysis. I cannot find the way to accept both dialects in this case.

masatake commented 3 years ago

The names of predefined inquiry directives are given: PLSQL. We can use it as a heuristic for deciding how to handle $$.

However, it seems that a user can define inquiry directives.

I wonder how to define inquiry directives.

I found an example in https://www.morganslibrary.org/reference/plsql/inquiry_directives.html.


ALTER SESSION SET PLSQL_CCFlags = 'UW_Flag:1, Some_Flag:2, PLSQL_CCFlags:42';

set serveroutput on

BEGIN
  dbms_output.put_line($$UW_Flag);
  dbms_output.put_line($$Some_Flag);
  dbms_output.put_line($$PLSQL_CCFlags);
END;
/

In the example, $$UW_Flag and $$Some_Flag. For them, the heuristic doesn't work.

My question is about the line:

ALTER SESSION SET PLSQL_CCFlags = 'UW_Flag:1, Some_Flag:2, PLSQL_CCFlags:42';

It seems that UW_Flag and Some_Flag are defined here.

I got some questions:

  1. Should ctags extrac tags for UW_Flag and Some_Flag in ALTER SESSION SET PLSQL_CCFlags = 'UW_Flag:1, Some_Flag:2, PLSQL_CCFlags:42'; ? If the answer is yes, what "kind" should we use for tagging?
  2. Is there another way to define inquiry directives?

I'm thinking about making a context-dependent table recording the names of "inquiries directives". Known predefined ones are stored when building the table.

masatake commented 3 years ago

I found hints in "2.9.1.4.2 Assigning Values to Inquiry Directives" of https://docs.oracle.com/en/database/oracle/oracle-database/18/lnpls/plsql-language-fundamentals.html#GUID-623713EB-4FD1-4C8B-955C-4135342F5F68.

bagl commented 3 years ago

@masatake thank you for your interest and swift reaction.

I like your idea to special case the predefined directives, as the names are rather specific and I cannot image using them frequently in postgresql quoted strings.

1) The alter session statements would most likely not reside in your source code repository, so I don't think these tags must be extracted. 2) I haven't found any other way, but it's possible there will be some standard DBMS_xyz package that would do the same assignment as the alter session above (that's usually the case, but I've not found such package)

masatake commented 3 years ago

@bagl, thank you.

It seems that I found another bug. Could you help me to confirm the bug?

create or replace package body demo_pkg2 is

test_var2 varchar2(64) := $xyz$ABC$xyz$;

function test_func2 return varchar2
as
begin
    return test_var2;
end;

end demo_pkg2;

Do you think this is valid input for postgresql? I think this is valid. In that case ctags should capture test_func2. However, it doesn't because the SQL parser doesn't handle ; after $$.

diff --git a/parsers/sql.c b/parsers/sql.c
index 8e93f2881..c1794f604 100644
--- a/parsers/sql.c
+++ b/parsers/sql.c
@@ -710,13 +710,13 @@ static tokenType parseDollarQuote (vString *const string, const int delimiter)
                                end_p++;
                        }

+                       if (c != EOF)
+                               ungetcToInputFile (c);
+
                        if (! *end_p) /* full tag match */
                                break;
                        else
-                       {
-                               ungetcToInputFile (c);
                                vStringNCatS (string, tag, (size_t) (end_p - tag));
-                       }
                }
        }

The way to fix is simple but I cannot write a test case. So I need your help.


I found postgresql doesn't have the concept 'package'. So the above code is invalid. I'm looking for a pattern like:

...
$$QUOTED STRING$$;something input including a language object to be tagged

The SQL parser doesn't handle $$; well. So the language object may not be tagged.

masatake commented 3 years ago

Preprocessor tokens can collide with the doller quotes, too.

masatake commented 3 years ago

I update the pull request. The update code can handle "$$USER_DEFINED_CCFLAG" as far as it is defined in a same source code.

bagl commented 3 years ago

To make it a bit worse, it looks like in PostgreSQL one defines functions using a string literal... Below are 2 valid PostgreSQL functions containing $$; character sequence:

# input1.sql
CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars() RETURNS VARCHAR AS '
DECLARE
    test_var_no_tag VARCHAR(64) := $$ABC$$;
    test_var_tagged VARCHAR(64) := $xyz$XYZ$xyz$;
    test_var        INTEGER := 1;
BEGIN
    RETURN  TO_CHAR(test_var, ''000'') || test_var_no_tag || test_var_tagged;
END;
' LANGUAGE plpgsql;

# input2.sql
CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars2() RETURNS VARCHAR LANGUAGE plpgsql AS $$
DECLARE
    test_var_no_tag VARCHAR(64) := 'ABC2';
    test_var_tagged VARCHAR(64) := 'XYZ2';
    test_var        INTEGER := 2;
BEGIN
    RETURN  TO_CHAR(test_var, '000') || test_var_no_tag || test_var_tagged;
END;
$$;

The first function body is defined using a string quoted with ' and uses dollar quoted literals for the test_var_*.

The second function body is defined using dollar quoted literal but it should not generated any tags by itself (this is tested here and here but without any tags inside the function body).

I expect the following ctags to be generated

expected12.tags

test_dollar_quoted_string_vars  input.sql   /^CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars() RETURNS VARCHAR AS '$/;"  f
test_var_no_tag input.sql   /^test_var_no_tag VARCHAR2(64) := $$ABC$$;$/;"  v   function:test_dollar_quoted_string_vars
test_var_tagged input.sql   /^test_var_tagged VARCHAR2(64) := $xyz$XYZ$xyz;$/;" v   function:test_dollar_quoted_string_vars
test_var    input.sql   /^test_var INTEGER := 1;$/;"    v   function:test_dollar_quoted_string_vars
test_dollar_quoted_string_vars2 input.sql   /^CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars() RETURNS VARCHAR LANGUAGE plpgsql AS \$\$$/;"  f
test_var_no_tag input.sql   /^test_var_no_tag VARCHAR2(64) := 'ABC2;$/;"    v   function:test_dollar_quoted_string_vars2
test_var_tagged input.sql   /^test_var_tagged VARCHAR2(64) := 'XYZ2;$/;"    v   function:test_dollar_quoted_string_vars2
test_var    input.sql   /^test_var INTEGER := 2;$/;"    v   function:test_dollar_quoted_string_vars2

Good point regarding the Oracle preprocessor tokens. They have the form $identifier and they state in the docs

The character $ can also appear inside plsql_identifier, but it has no special meaning there.

However, I've not found a way to create user defined preprocessor tokens (and cannot imagine what they would actually do). The predefined ones are:

hence all contain just 1 dollar sign and do not collide with Oracle directives or PostgreSQL dollar quoted strings.

They do, however, create other problems I think...

#input3.sql
CREATE FUNCTION test RETURN NUMBER
AS
    a $IF DBMS_DB_VERSION.VERSION < 10 $THEN NUMBER; $ELSE BINARY_DOUBLE; $END := 1;
BEGIN
    RETURN a;
END;

should generate following tags expected3.tags

test    input.sql   /^CREATE FUNCTION test RETURN NUMBER$/;"    f
a   input.sql   /^a $IF DBMS_DB_VERSION.VERSION < 10 $THEN NUMBER; $ELSE BINARY_DOUBLE; $END := 1;$/;"  v   function:test

(the file names are added by @masatake.)

bagl commented 3 years ago

The PostgreSQL quoted strings can even be nested...

CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars3() RETURNS VARCHAR LANGUAGE plpgsql AS $xx$
DECLARE
    test_var_no_tag VARCHAR(64) := $aa$ABC3$aa$;
    test_var_tagged VARCHAR(64) := $$XYZ3$$;
    test_var        INTEGER := 3;
BEGIN
    RETURN  TO_CHAR(test_var, '000') || test_var_no_tag || test_var_tagged;
END;
$xx$;

Here $xx$ is outer string containing $aa$ and $$ inner strings...

masatake commented 3 years ago

Thank you for the valuable input.

Do you know u-ctags has an ability to run a parser from another parser? See https://docs.ctags.io/en/latest/internal.html?highlight=guest#guest-parser-promise-api.

The question is how to know a string literal (or a quote string) is a SQL code or not. LANGUAGE plpgsql is an important hint.

In input1.sql, LANGUAGE plpgsql comes after the string literal. In input2.sql, LANGUAGE plpgsql comes before the quoted string. Could you give me more hints where LANGUAGE plpgsql is appeared on the code.

[jet@living]~/var/ctags-github% cat /tmp/input2.sql                                           
CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars2() RETURNS VARCHAR LANGUAGE plpgsql AS $$
DECLARE
    test_var_no_tag VARCHAR(64) := 'ABC2';
    test_var_tagged VARCHAR(64) := 'XYZ2';
    test_var        INTEGER := 2;
BEGIN
    RETURN  TO_CHAR(test_var, '000') || test_var_no_tag || test_var_tagged;
END;
$$;
[jet@living]~/var/ctags-github% ./ctags --quiet --options=NONE --sort=no -o - --kinds-SQL='*' --extras=+'{guest}' /tmp/input2.sql
test_dollar_quoted_string_vars2 /tmp/input2.sql /^CREATE OR REPLACE FUNCTION test_dollar_quoted_string_vars2() RETURNS VARCHAR LANGUAGE plpgsql AS/;"   f
test_var_no_tag /tmp/input2.sql /^    test_var_no_tag VARCHAR(64) := 'ABC2';$/;"    v
test_var_tagged /tmp/input2.sql /^    test_var_tagged VARCHAR(64) := 'XYZ2';$/;"    v
test_var    /tmp/input2.sql /^    test_var        INTEGER := 2;$/;" v
[jet@living]~/var/ctags-github% git diff --stat
 parsers/sql.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

It works. We can improve the accuracy of parsing if we know more about LANGUAGE plpgsql .

About input3.sql, ctags works as you expected. However, the kind of a is `l'.

[jet@living]~/var/ctags-github% ./ctags --kinds-SQL='*' -o - /tmp/bar.sql
a   /tmp/bar.sql    /^    a $IF DBMS_DB_VERSION.VERSION < 10 $THEN NUMBER; $ELSE BINARY_DOUBLE; $END := 1;$/;"  l   function:test
test    /tmp/bar.sql    /^CREATE FUNCTION test RETURN NUMBER$/;"    f

Is l (local variable) acceptable?

bagl commented 3 years ago

I had no idea you can "nest" parsers, that's useful :)

Based on the official PostgreSQL docs one defines functions as

CREATE [ OR REPLACE ] FUNCTION
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
    [ RETURNS rettype
      | RETURNS TABLE ( column_name column_type [, ...] ) ]
  { LANGUAGE lang_name
    | TRANSFORM { FOR TYPE type_name } [, ... ]
    | WINDOW
    | IMMUTABLE | STABLE | VOLATILE | [ NOT ] LEAKPROOF
    | CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT
    | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
    | PARALLEL { UNSAFE | RESTRICTED | SAFE }
    | COST execution_cost
    | ROWS result_rows
    | SUPPORT support_function
    | SET configuration_parameter { TO value | = value | FROM CURRENT }
    | AS 'definition'
    | AS 'obj_file', 'link_symbol'
  } ...

Procedures work the same way

CREATE [ OR REPLACE ] PROCEDURE
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
  { LANGUAGE lang_name
    | TRANSFORM { FOR TYPE type_name } [, ... ]
    | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
    | SET configuration_parameter { TO value | = value | FROM CURRENT }
    | AS 'definition'
    | AS 'obj_file', 'link_symbol'
  } ...

The conventions used read:

brackets ([ and ]) indicate optional parts Braces ({ and }) and vertical lines (|) indicate that you must choose one alternative.

Hence, for functions, we have funciton name, optional arguments and optional specification of return value followed by options such as LANGUAGE lang_name and AS 'definition'. These can come up in any order.

The built in languages are:

The pl prefix stands for Procedural Language and optional u suffix for unsafe (of no importance with respect to tagging). These could be used to select the proper parser for the body of the stored procedure/function in PostgreSQL.

New languages can be added and some of them are mentioned even in the docs:


I've missed that about input3.sql, local variable is perfectly fine, sorry.

masatake commented 3 years ago

Thank you very much. I think the original issue you reported is mostly fixed in #3012. I would like to merge #3012 first. I will open a new issue for parsing nested code.

bagl commented 3 years ago

Perfect. If you ever need help with SQL, I'll be here :)

Thank you for your effort!