Regex doesn't work correctly?

Cris70 commented 3 years ago

Hi, I have implemented a parser for Rexx using the universal mapper. The configuration is very simple:

    "Rexx"  : {
                "regex":
                [
                    [
                        "^(\\S+:)\\s.*$",
                        "[(:].*$",
                        "",
                        false
                    ]
                ],
                "indent": 4,
                "obligatory indent": false,
                "empty line in map before": "",
                "line numbers before": false,
                "prefix": "",
                "suffix": "()",
                "syntax": "Packages/Text/Plain text.tmLanguage"
    },

However, the map sometimes shows statements that shouldn't be mapped. If I do a search in ST3 using the same regex as the one defined in the config, it works correctly.

Example this statement in Rexx code is shown in the map as it it was a function: ________'W0DB.TCORLIBS.\nCurrProc: '||CurrProc||'\n', (I am using underscores in front of the statement to show spaces, that would otherwise be hidden by the markup interpretation)

If is use the search function in ST3 with the ^(\S+:)\s.*$ regex, that statement is NOT found (correctly).

oleg-shilo commented 3 years ago

Can you please share the use case that I can use to work on the problem:

the content file
steps to reproduce

Your line "the map sometimes shows" makes me feel uneasy. ST3 is great and plugin hosting architecture is clever but.... debugging of plugins is plain impossible. This makes any investigation of intermittent problems extremely difficult .

Cris70 commented 3 years ago

Hi @oleg-shilo, thank you for your reply! I cannot share the code, however I'll create a use case for you and I'll share it here. Stay tuned :-)

Cris70 commented 3 years ago

Hi @oleg-shilo this is a gist with some dummy rexx code that shows the bad behaviour: https://gist.github.com/Cris70/c91747c2266d2cb9b4a149dcf14ad558 If you copy the code in ST3, you should see this in the Code-Map (after adding support for Rexx as shown in my first post): The first three rows in Code-Map are not rexx procedures, instead they are simple strings in the call to the MsgStop() procedure. They should not be shown, as they do not match the ^(\S+:)\s.*$ regex.

oleg-shilo commented 3 years ago

Thank you. Will try to find time to look at it

Cris70 commented 3 years ago

Hi @oleg-shilo, any news on this?

oleg-shilo commented 3 years ago

Sorry, Cris. I haven't had a chance to look at it. I only now have approached a reasonably comfortable state of the backlog of my major open-source projects. I have now updated the VSCode extensions and will be able to proceed with updating ST3 plugins including codemap.

I'm looking at the plugin right now. I am effectively reverse-engineering it.

Oh man, it's not easy. It was a long time ago I looked at it and the absence of the debugger does not help. It's a MAJOR drawback with ST3 plugin development for me.

Cris70 commented 3 years ago

No problem @oleg-shilo. I know the problem as I myself tried to reverse engineer your code, but the absence of a way to debug ST3 plugins got me stumped. Take your time.

oleg-shilo commented 3 years ago

I am getting there.... :)

oleg-shilo commented 3 years ago

OK, found the problem.

First of all this is how the universal mapper settings are used at runtime:

https://github.com/oleg-shilo/sublime-codemap/blob/cf67e22ff297b16ac1bce10dcc1f4bada1b48596/code_map_support.py#L209-L227

Note the line text that is tested with regex is left trimmed before the test. Meaning that if your code has line " say_hello():" the text that is tested with regex is "say_hello():"

And this trimming is what is causing your problem.

Can you please have a look if you can add extra conditions that would prevent detection of any line that starts with '. lf it's impossible then we can think about an alternative

oleg-shilo commented 3 years ago

You can definitely use this changed definition:

"Rexx"  : {
                "regex":
                [
                    [
                        "^(?!')(\\S+:)\\s.*$",
                        "[(:].*$",
                        "",
                        false
                    ]
                ],

Cris70 commented 3 years ago

First of all, many thanks for your work and your explanation. Your updated regex is much welcome, but unfortunately there are other situations where this could happen that are not easily detectable by changing the matching pattern. Is it absolutely necessary to left-trim the string before testing? After all, what is the meaning of the "^" pattern in the regex if you first remove the leading spaces?

oleg-shilo commented 3 years ago

Let me explain first the motivation for l-trimming the string before processing.

A line that potentially contains a method declaration has two types of information the signature parser needs to process:

leading whitespace (indent) representing the scope/nesting-level of the method
non-whitespace string representing the method signature

Universal parser is responsible for parsing whitespaces. Though it delegates parsing of non-whitespece content (potential signature) to the user defined regex. That's why regex pattern is always applied to non-whitespace part of the line.

To put it simple, parsing a signature does not depend on the amount of whitespace but only on the signature pattern. The parsing algorithm is the same regardless if it is

                            int Add(int a, int b)

or

int Add(int a, int b)

That's why a trimmed version of the line is subjected to the regex matching. Though not having it described has no excuse :( For any syntaxes when this approach is not flexible enough you are arguably better off with a simple dedicated parser (written in Python).

In your case your signatures happen to have no specific tokens to recognize them except :. That's why your regex is so generic so it picks any other : even in string literals.

                      'W0DB.TCORLIBS.\nCurrProc: '||CurrProc||'\n'

Even if we always test the raw non-trimmed string your regex pattern would still fail for this case:

'W0DB.TCORLIBS.\nCurrProc: '||CurrProc||'\n'

In fact your approach has a flaw. It picks the signatures from strings and comments regardless of the abount of leading whitespaces

Thus you have an option to either create a more comprehensive parser or at least rule out analyzing string literals (I shared the solution with you)

I am still opened to idea to allow testing raw string as a configurable option if I did not convince you yet :)

Cris70 commented 3 years ago

Thank you Oleg for your explanation and for taking the time to write it. I don't want to give you more work to accommodate what is a niche scripting language. I think it is the time for me to develop a dedicated parser in Python. I'll share it with you once it's ready!

oleg-shilo / sublime-codemap

Regex doesn't work correctly? #37