universal-ctags / ctags

A maintained ctags implementation
https://ctags.io
GNU General Public License v2.0
6.52k stars 621 forks source link

[systemverilog] parser mistakes typedef for a module's port #2413

Closed antoinemadec closed 4 years ago

antoinemadec commented 4 years ago

The name of the parser: verilog.c

The command line you used to run ctags:

$ ctags --options=NONE foo.sv

The content of input file: foo.sv

typedef bit[31:0] int32_t;
module mod(
  input bit clk,
  input int32_t a
);
endmodule

The tags output you are not satisfied with:

clk     foo.sv  /^  input bit clk,$/;"  p       module:mod
int32_t foo.sv  /^  input int32_t a$/;" p       module:mod
int32_t foo.sv  /^typedef bit[31:0] int32_t;$/;"        T
mod     foo.sv  /^module mod($/;"       m
mod.clk foo.sv  /^  input bit clk,$/;"  p       module:mod
mod.int32_t     foo.sv  /^  input int32_t a$/;" p       module:mod

The tags output you expect:

clk     foo.sv  /^  input bit clk,$/;"  p       module:mod
a       foo.sv  /^  input int32_t a$/;" p       module:mod
int32_t foo.sv  /^typedef bit[31:0] int32_t;$/;"        T
mod     foo.sv  /^module mod($/;"       m
mod.clk foo.sv  /^  input bit clk,$/;"  p       module:mod
mod.a   foo.sv  /^  input int32_t a$/;" p       module:mod

The version of ctags:

$ ctags --version
Universal Ctags 0.0.0(81fa5b1a), Copyright (C) 2015 Universal Ctags Team                    
  Universal Ctags is derived from Exuberant Ctags.                                            
  Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert                                 
    Compiled: Feb  4 2020, 14:36:50                                                           
    URL: https://ctags.io/                                                                    
    Optional compiled features: +wildcards, +regex, +iconv, +option-directory, +xpath, +json, 
  +interactive, +sandbox, +yaml, +packcc
masatake commented 4 years ago

Thank you for reporting. Could you try #2492 ?

antoinemadec commented 4 years ago

Thanks a lot for your response!

2492 fixes my original file. Fix is definitely improving things.

However, it still fails if I do: ctags --options=NONE a.sv b.sv

Where: a.sv

`include "b.sv"

module mod(
  input bit clk,
  input int32_t a
);
endmodule

b.sv

typedef bit[31:0] int32_t;
masatake commented 4 years ago

Currently, ctags cannot track cross-input file dependencies like a.sv and b.sv. I'm trying to add features handling the cross-input file dependencies: https://github.com/universal-ctags/ctags/issues/2428

However, it will take more than a year to implement it. So I will merge #2492 first.

antoinemadec commented 4 years ago

@masatake got it, makes sense!

I think there is a way to fix this which does not require cross-input file dependencies. It rather simple: the name of the input/output/inout port is always the last word before a comma or the last parenthesis.

module mod(
  input dont_need_to_know_what_this_is port_name[64]
);

What do your think, would this be implementable?

Thanks, Antoine

masatake commented 4 years ago

Obviously Implementable. The question is who implements that algorithm.

The original maintainer of the systemverilog parser didn't want to take such algorithm.

https://github.com/universal-ctags/ctags/issues/1488 https://github.com/universal-ctags/ctags/issues/80 https://github.com/universal-ctags/ctags/pull/1495

I'm looking for a new maintainer who implements the algorithm. If we cannot find such person, we have to wait till I complete the "multi-pass/multi-file" infrastructure. It will take longer time.

So as usual, I have to say "pull requests are well come."

masatake commented 4 years ago

BTW, could you help me about writing a commit log for fixing the SystemVerilog parser? You know SystemVerilog well. So I would like to see https://github.com/universal-ctags/ctags/issues/2489 .

antoinemadec commented 4 years ago

@masatake , got it. I already looked at the code and tried to fix it before opening this ticket. The C code was a bit out of my league TBH. Will take another shot at it without any guarantee of results :smiley:

I just responded to your question on 2489. Don't hesitate to ask me if you have more question regarding SV, I would be happy to help :wink:

Thanks, Antoine

masatake commented 4 years ago

@antoinemadec, thank you.

One of the difficulties for changing the logic in verilog.c is that the parsers (systemverilog and verilog) parsers use "kind" (verilogKind) as token type like:


typedef struct sTokenInfo {
    verilogKind         kind;

kind is a thing assigned to a tag entry in the output. In other hand, "token" is a parser private thing.

See python.c. It defines types for token like:

typedef enum eTokenType {
    /* 0..255 are the byte's value */
    TOKEN_EOF = 256,
    TOKEN_UNDEFINED,
    TOKEN_INDENT,
    TOKEN_KEYWORD,
    TOKEN_OPERATOR,
    TOKEN_IDENTIFIER,
    TOKEN_STRING,
    TOKEN_WHITESPACE,
} tokenType;

typedef struct {
    int             type;

Defining types for token brings great flexibility to the parser code. If you cannot find easy way to implement what you want, "defining types for token and using the kinds only for emitting tag entries" may be a hint.

Thank you for trying improving the parser. Feel free to ask me the question. I don't know SystemVerilog. However, I know ctags, especially language common part.

I'm not good at English. So the ways to express my "thanks" are very limited. "Thank you very much", "very much" suffixed version of "thank you", is kept for the time when I receive a pull request for the issue from you:-)

Good luck.

p.s. #2489, thank you.

I'm thinking about organizing , "language consultants". Can I allow me to contact you when I get a question about SystemVerilog in the future? I would like to see #2494.

masatake commented 4 years ago

@vhda, could you look at this one and give us comments if you have. I also would like you to see #2494. BTW, fnally I implemented an infrastructure for multi-pass parsing on an input file is implemented. I'm working on multi-pass parsing on multiple input files.

vhda commented 4 years ago

An issue with Verilog is that you can have both

input name,

and

input type name,

That is, when you parse the word after type you don't know if it's a type or a name. But it should be possible to store each string and only determine the token name when , or ; are found. In fact, I vaguely remember doing something like this at the time to solve some other problem.

Let's do as following: let me pick up on a work in progress I have here to extend support to structs. This should help me remember how everything is coded and refresh my C, which has not been used since. I'll then rebase this change over master branch to learn what changed. Finally I'll look into this problem.

Cheers, Vitor

masatake commented 4 years ago

Feel free to revert https://github.com/universal-ctags/ctags/pull/2492/commits/23297f617c9b402d4fc8715cee6aad4a497c00d4 .

vhda commented 4 years ago

Far from that! Preferably we would even talk a bit about cork, to see if I finally understand it.

masatake commented 4 years ago

See http://docs.ctags.io/en/latest/internal.html#output-tag-stream about cork. And feel free to ask me question about ctags itself.

masatake commented 4 years ago

SystemVerilog parser and Ada parser, both have the same limitation in their design level. They reuse their kind indexes for their token types. They must be separated. The kind indexes are for tag emission. The token types are for tokenizing. They are different. This separation brings us great flexibility to improve and change the parsers without exposing parser internal changes to users' eyes.

masatake commented 4 years ago
diff --git a/parsers/verilog.c b/parsers/verilog.c
index 737fc815..a3f4ce6e 100644
--- a/parsers/verilog.c
+++ b/parsers/verilog.c
@@ -607,8 +607,8 @@ static void dropEndContext (tokenInfo *const token)
    vStringDelete(endTokenName);
 }

-
-static void createTag (tokenInfo *const token)
+static int createTag (tokenInfo *const token);
+static int createTagFull (tokenInfo *const token, int *corkIndexFQ)
 {
    tagEntryInfo tag;
    verilogKind kind;
@@ -627,7 +627,7 @@ static void createTag (tokenInfo *const token)
    if (vStringLength (token->name) == 0 || ! kindEnabled (kind))
    {
        verbose ("Unexpected empty token or kind disabled\n");
-       return;
+       return CORK_NIL;
    }

    /* Create tag */
@@ -668,7 +668,10 @@ static void createTag (tokenInfo *const token)
        tag.name = vStringValue (scopedName);

        markTagExtraBit (&tag, XTAG_QUALIFIED_TAGS);
-       makeTagEntry (&tag);
+       if (corkIndexFQ)
+           *corkIndexFQ = makeTagEntry (&tag);
+       else
+           makeTagEntry (&tag);

        vStringDelete (scopedName);
    }
@@ -705,6 +708,12 @@ static void createTag (tokenInfo *const token)

    /* Clear no longer required inheritance information */
    vStringClear (token->inheritance);
+   return corkIndex;
+}
+
+static int createTag (tokenInfo *const token)
+{
+   return createTagFull (token, NULL);
 }

 static bool findBlockName (tokenInfo *const token)
@@ -1162,7 +1171,7 @@ static void tagNameList (tokenInfo* token, int c)
 {
    verilogKind localKind;
    bool repeat;
-
+   int cork = CORK_NIL, fqcork = CORK_NIL;
    /* Many keywords can have bit width.
    *   reg [3:0] net_name;
    *   inout [(`DBUSWIDTH-1):0] databus;
@@ -1206,7 +1215,7 @@ static void tagNameList (tokenInfo* token, int c)
            else
            /* Create tag in case name is not a known kind ... */
            {
-               createTag (token);
+               cork = createTagFull (token, &fqcork);
            }
        }
        else
@@ -1237,9 +1246,22 @@ static void tagNameList (tokenInfo* token, int c)
        }
        if (c == ',')
        {
+           cork = CORK_NIL;
+           fqcork = CORK_NIL;
            c = skipWhite (vGetc ());
            repeat = true;
        }
+       else if (isIdentifierCharacter(c))
+       {
+           tagEntryInfo *e;
+           e = getEntryInCorkQueue (cork);
+           if (e)
+               e->placeholder = 1;
+           e = getEntryInCorkQueue (fqcork);
+           if (e)
+               e->placeholder = 1;
+           repeat = true;
+       }
    } while (repeat);
    vUngetc (c);
 }
[jet@living]~/var/ctags% cat /tmp/foo.sv 
cat /tmp/foo.sv 
pack.sv:
package PACK;
    localparam N = 100;
    typedef logic [N-1:0] PORT_TYPE_T;
endpackage // PACK

port.sv:
import PACK::*;

module port (
    input  PORT_TYPE_T  port,
    input  PORT_TYPE_T  port2);
endmodule // port
[jet@living]~/var/ctags% ./ctags  --sort=no -o - /tmp/foo.sv
./ctags  --sort=no -o - /tmp/foo.sv
PACK    /tmp/foo.sv /^package PACK;$/;" K
N   /tmp/foo.sv /^    localparam N = 100;$/;"   c   package:PACK
PORT_TYPE_T /tmp/foo.sv /^    typedef logic [N-1:0] PORT_TYPE_T;$/;"    T   package:PACK
port    /tmp/foo.sv /^module port ($/;" m
port    /tmp/foo.sv /^    input  PORT_TYPE_T  port,$/;" p   module:port
port2   /tmp/foo.sv /^    input  PORT_TYPE_T  port2);$/;"   p   module:port
[jet@living]~/var/ctags% cat /tmp/bar.sv 
cat /tmp/bar.sv 
`include "b.sv"

module mod(
  input bit clk,
  input int32_t a
);
endmodule // mod
[jet@living]~/var/ctags% ./ctags  --sort=no -o - /tmp/bar.sv
./ctags  --sort=no -o - /tmp/bar.sv
mod /tmp/bar.sv /^module mod($/;"   m
clk /tmp/bar.sv /^  input bit clk,$/;"  p   module:mod
a   /tmp/bar.sv /^  input int32_t a$/;" p   module:mod
masatake commented 4 years ago

input t a,

The algorithm:

  1. create a tag for t as we did before.
  2. if you find a character that can be part of an identifier as 1, cancel the emission of the tag for t.
  3. create a tag for the identifier a, goto 2.

The step 2. is based on what @vhda wrote.

I guess this doesn't work well because createTag() is called recursively. So the cancellation works partially.

hirooih commented 4 years ago

Hi, Are anyone working on this issue? If not, I'd like to try implement @antoinemadec 's algorithm (https://github.com/universal-ctags/ctags/issues/2413#issuecomment-609486797).


SystemVerilog BNF is very complex, so I squashed it. https://docs.google.com/document/d/1Sb0-iO9HofB7BR52XUarhGYPph6mFMx5oIFcU8eeFj0/edit?usp=sharing

Here is my basic idea.

do {
  skip light-gray tokens
  If the first token is identifier:
     the identifier is a type_identifier or an identifier with implicit_data_type (https://github.com/universal-ctags/ctags/issues/2413#issuecomment-609580989)
  else
     the token is a keyword: (already implemented)
}

@masatake san,

https://github.com/universal-ctags/ctags/issues/2413#issuecomment-609580989

One of the difficulties for changing the logic in verilog.c is that the parsers (systemverilog and verilog) parsers use "kind" (verilogKind) as token type like: ... kind is a thing assigned to a tag entry in the output. In other hand, "token" is a parser private thing.

I think the current implementation of verilog.c is straight-forward by reading as follows;

Any comments or feedback are welcome. Thank you.

masatake commented 4 years ago

@hirooih, I would like you to know the typical good parser implementation.

In ctags, generally, "kind" is for tags, not for tokens. A user can know what kinds are available in a parser with "ctags --list-kinds-full=".

"kinds" defined in a parser is part of APIs. We should keep their compatibility as much as possible. We can implement a parser for a language in various ways. However, changes in the implementation should not be visible as changes of "kinds" definition.

When writing a token oriented parser, you may want to assign a type to a token like NUMBER, SYMBOL, SEPARATOR, STRING-LITERAL, ... You can define them as you want. They should be used in parser internal. If you change the type definition and/or assignment, the "kinds" definition of parser should not be changed. The types of token and kinds of tags are different things. Users should not know the types of tokens. Actually we don't provide --list-token-types option.

A good parser like Python parser uses token types and token keywords.

typedef enum eTokenType {
    /* 0..255 are the byte's value */
    TOKEN_EOF = 256,
    TOKEN_UNDEFINED,
    TOKEN_INDENT,
    TOKEN_KEYWORD,
    TOKEN_OPERATOR,
    TOKEN_IDENTIFIER,
    TOKEN_STRING,
    TOKEN_ARROW,                /* -> */
    TOKEN_WHITESPACE,
} tokenType;

If a token has TOKEN_KEYWORD type, it can have a subtype called keyword:

enum {
    KEYWORD_as,
    KEYWORD_async,
    KEYWORD_cdef,
    KEYWORD_class,
    KEYWORD_cpdef,
    KEYWORD_def,
    KEYWORD_extern,
    KEYWORD_from,
    KEYWORD_import,
    KEYWORD_inline,
    KEYWORD_lambda,
    KEYWORD_pass,
    KEYWORD_return,
};

These constants are used in the parser internally. The kinds of python are defined separately:

typedef enum {
    K_CLASS,
    K_FUNCTION,
    K_METHOD,
    K_VARIABLE,
    K_NAMESPACE,
    K_MODULE,
    K_UNKNOWN,
    K_PARAMETER,
    K_LOCAL_VARIABLE,
    COUNT_KIND
} pythonKind;

SystemVerilog is one of the parsers that use keywords and kinds mixed-way. I think this mixture makes improving the parser harder (without breaking the APIs).

I wrote much. HOWEVER, YOU CAN DO AS YOU WANT for fixing the issue.

What I would like to know is "kinds" is part of the APIs of the parser. We should try to keep compatibility. SystemVerilog parser has enough test cases. They will detect incompatibility unexpectedly introduced.


enum verilogKind and KeywordTable[] are for kinds of tokens

I don't think this is good. Using kind, a user-visible thing in tokens for typing.

VerilogKinds[] and SystemVerilogKinds[] are for kind of identifier-tokens (assigned to a tag entry in the output)

Your code reading is correct. However, using kind for tokens is not a good idea. kinds are for tags. tags and tokens are different things. tokens represent the result of reading the input source file. tags represent the output of ctags.

masatake commented 4 years ago

BTW, if you need explanations for registerEntry() and foreachEntriesInScope() for fixing this issue, please let me know.

hirooih commented 4 years ago

@masatake san

I believe I understood what you wrote. I read your similar explanations in #2576, #2618, and this issue. I just wrote that we could use "kind" as a general English word.

I also think enum verilogKind is a little bit tricky. It bridges KeywordTable[] and VerilogKinds[]/SystemVerilogKinds[]. If verilogKindis greater than or equal to 0, it can be used as an index of VerilogKinds[]/SystemVerilogKinds[].

And APIs ( initTagEntry, makeTagEntry) refers only VerilogKinds[]/SystemVerilogKinds[] and enum verilogKind (>=0). The verilog.c does not break APIs.

BTW, if you need explanations for registerEntry() and foreachEntriesInScope() for fixing this issue, please let me know.

Thank you. I don't think I have to change this part, but I will ask your help if I need.


Before I start working, I would like to propose changing Units/parser-verilog.r/*/args.ctags as follows to output all supported information. We can detect any change of output not to break compatibility. And --sort=no helps debugging.

--extras=+q
--fields=+i
--kinds-systemverilog=+Q
--sort=no

(This is the output of cat */args.ctags | sort | uniq at Units/parser-verilog.r/.)

Off course I also update */expected.tags.

May I do that?

masatake commented 4 years ago

--sort=no is acceptable. However, I don't think unifying args.ctags is a good idea. Each test is developed for its own purpose. And each args.ctags is designed for its purpose. Unifying args.ctags hides the original intention of the test cases.

You are the only person who improves the SystemVerilog parser on the earth. So If you really need unified args.ctags, you can do that.

hirooih commented 4 years ago

I don't think unifying args.ctags is a good idea.

I tried and found you were right. They added only noises. I added only "--sort=no". I sent pull-request #2650


--extras=+q

    parser-verilog.r/verilog-sf_bug73_3
    parser-verilog.r/systemverilog-assignment
    parser-verilog.r/systemverilog-2d-array
    parser-verilog.r/systemverilog-github2635
    parser-verilog.r/verilog-sf_bug108_2
    parser-verilog.r/verilog-memleak

emits only non-useful messages.

--fields=+i

no change

--kinds-systemverilog=+Q

no change

hirooih commented 4 years ago

@antoinemadec, I've commit the fix for this issue on #2653. Please try it.

hirooih commented 4 years ago

Let me close this issue.