RegEx support limitation

t-edson / SynFacilSyn

Scriptable Highlighter for the SynEdit Component of Lazarus

GNU General Public License v2.0

34 stars 16 forks source link

RegEx support limitation #16

Closed beNative closed 8 years ago

beNative commented 9 years ago

Hi,

Now only a small subset of regular expressions is supported for which you have done the parsing yourself. If you use SynRegExpr.pas you can support almost the complete RegEx syntax without the need of custom validation. You can make the current syntax a lot simpler because every token can be defined by a regular expression. So you don't need to have the Start, End, CharsStart, CharsEnd, Content and even TokPos. Some time ago a guy named "Garnet' made a folding syntax highlighter based on SynEdit that uses regular expression grammar.

Here is an example of his syntax definition file for a INI highlighter. The tokens are completely defined by regular expression patterns. Look for the 'LetterPress' project to find the pascal sources of his implementation.

grammar {
    uuid = '{89DD4C92-F4BE-4205-8D1F-20C95373B253}';
    name = 'INI';
    filter = 'ini';
    developer = 'Garnet';
    version = '0.1β';
    sample {
; Syntax highlighting
[Section]
Key=Value # Represents something
String="\"Hello, World!\""
Number=123456
    }
    range {
        name = 'ini';
        style = 'editor';
        style_symbol = 'symbol';
        style_number = 'number';
        rule {
            name = 'ini.section';
            style = 'Reserved Word';
            options = '8';
            pattern = '^\[[\w\d-]+\]$';
        }
        range {
            name = 'ini.comment';
            style = 'Comment';
            open {
                pattern = ';';
                options = '1';
            }
            open {
                pattern = '#';
                options = '1';
            }
            close {
                pattern = '$';
            }
        }
        range {
            name = 'ini.string';
            style = 'String';
            open {
                pattern = '"';
                options = '1';
            }
            close {
                pattern = '(?<!\\)"';
                options = '1';
            }
            rule {
                name = 'ini.string.escape';
                style = 'PHP String Special';
                pattern = '\\([abfnrtv'"\\?]|o[0-7]{1,2}|x[a-f\d]{1,3})';
            }
        }
    }
}

t-edson commented 9 years ago

SynFacilSyn doesn't use "SynRegExpr.pas" or any other Regex library, because it would make the highlighter to run slow (I have no exact statistics). SynFacilSyn is mainly "fast" and "easy". This highlighter have an special architecture focus on speed. When it uses RegEx, it makes some "translation" to fit the Regex into this architecture. This is another reason why it wouldn't be practical to use some standard Regex library. It would be possible to use some mix of Internal parser with standard library for Regex, in SynFacilSyn, but I haven's tested it. About Garnet's highlighter, I haven't found any information on the Web.

Alexey-T commented 9 years ago

Can u take some FAST regex lib. and make compare? e.g. take 100Kb Pascal file and test.

magorium commented 9 years ago

Sorry for the necromancy, but i have a question regarding this topic.

In practice i encountered the following situation (besides the other usual suspects): 1) Strings are enclosed by either single or double quotes 2) Decimal numbers can be integer number or float as in usual human readable notation. 3) Hexadecimal numbers are enclosed in either single or double quotes, and must thereafter end with an ASCII character "x" (or "X"). Between the quotes are the usual ASCII characters 0..9, a..f and these can be randomly grouped with a space character. 4) Binary numbers are enclosed in either single or double quotes, and must thereafter end with an ASCII character "b" (or "B"). Between the quotes are the usual ASCII characters 0..1 and these can be randomly grouped with a space character.

(fwiw: it is a real in practice used scripting language. I'm not making things up there, although i realize the above might sound a little ridiculous).

SynFacilSyn documentation does not mention the possibility to change the attribute name when using an advanced regexp and then bail out early (as that probably would have saved me, see explanation).

Because i can only check the matching "x" or "b" character after the previous match, i see no way of being able to 100% guarantee the actual 'literal' contents is correct.

Therefore (if I understand correctly), the only solution available to me is by using multiple token definitions, so that i am able to make a clear distinction between the two.

Alas, the cap of being able to define max. 4 token definitions won't be able to cut it. Begin and ending quote character must match as well as number contents matching the range of closing notation/format character.

Besides the above 'issue', i would like to define even more token entries in order to highlight the language (more) correctly.

So, although i understand not wanting to use a complete regex parser, i seem to run into a wall with this scripting language that makes use of postfix characters in order to distinguish between different tokens.

Could it be i'm doing something wrong there, perhaps overlooking something (obvious) or did i actually run into a limitation of some sorts that can't be easily solved/worked around ?

It's the first time i run into something like this with synfacilsyn.

BTW: thank you very much for your highlighter and completion components.

t-edson commented 9 years ago

Hi magorium. I have faced before,the situation you mention. SynFacilSyn have a basic script language for defining delimited tokens using regex. It is described in Section 4.6.6 of documentation. This pseudo-language can change some of the flow of the parser (documented) and the attribute (no documented, and probably no functional yet). I think this can be the solution to the problem. Let me check the current state of this feature for to see if can be activated, and I will answer later.

magorium commented 9 years ago

Hello Tito,

Thank you for your quick reply.

Before you resort to doing anything drastic, please have a look at the following definitions (note that i haven't included matching upper/lower postfix character yet):

Definition for hexadecimal numbers:

{Token Start="'" Regex="[0-9a-fA-F ]+'x" RegexMatch="Complete" Attribute='Number'} {/Token} {Token Start=""" Regex="[0-9a-fA-F ]+"x" RegexMatch="Complete" Attribute='Number'} {/Token}

And the definition for binary numbers:

{Token Start="'" Regex="[0-9a-fA-F ]+'b" RegexMatch="Complete" Attribute='Number'} {/Token} {Token Start=""" Regex="[0-9a-fA-F ]+"b" RegexMatch="Complete" Attribute='Number'} {/Token}

Besides that, i currently also have a 'normal' number definition (yes, one too many):

{Token CharsStart="0..9" Content = '0..9' Attribute="Number"} {/Token}

Thank you for mentioning chapter 4.6.6. (which leads me to chapter 5.5.2 as well)

When defining a token using the advanced form, i seem to be unable to find the logical processing sequence that would be necessary in order to match all situations as dictated by the scripting language.

It might perfectly well be that i misunderstood the manual, i am not grasping the implementation correctly or made a logical error somewhere.

fwiw: the attribute change at any point would allow for having a more generic custom advanced definition. In my particular case i could have used the quotation characters to recognize a string as well e.g. why waste time on matching another definition (as i have a separate definition for that now) if the logic is already in place.

Let me try to explain my 'postfix character problem' with some pseudo code:

Start:

Match (single quote), ifTrue Move(check_single)
Match (double_quote), ifTrue Move(check_double)
Match ([0-9]+), ifFalse Exit()

check_single:

Match([0-9a-fA-F ]+), IfTrue Move(check_single_close)
Match([0-1 ]+), IfFalse Exit()

check_single_close

Match(single_quote), ifFalse Exit()

Now what ? Match ("x") or Match("b") depends.

check_double:

Match([0-9a-fA-F ]+), IfTrue Move(check_double_close)
Match([0-1 ]+), IfFalse Exit()

check_double_close:

Match(double_quote), ifFalse Exit()

Now what ? Match ("x") or Match("b") depends.

Note that i'm doing this from the top of my mind (using your manual) as i have been unable to solve the problem in theory yet.

Any hint would be appreciated as currently i seem to be stuck on my part.

Edit: removed some formatting errors plus copy paste error

t-edson commented 9 years ago

First aclaration: You cannot use this two definitions at the same XML:

<Token Start="'" Regex="[0-9a-fA-F ]+'x" RegexMatch="Complete" Attribute='Number'> </Token>
<Token Start="'" Regex="[0-9a-fA-F ]+'b" RegexMatch="Complete" Attribute='Number'> </Token>

They use the same first char, so the last will overwrite the first.

t-edson commented 9 years ago

Well documentation in english probably is not so clear, like it is in spanish, my native language. But even in spanish it could be difficult to understand this script confusing language described in 4.6.6. Honestly, It's difficult for me too, to make an functional script, maybe someone else can have a better domain. Sorry, this script language was created for to be fast on processing, not for to be understandable for humans. I am currently working on the version 1.1, that have better support for Regex. I recommend to use this new version, but the documentation is only available in spanish by now. In 1.1 I'm experimenting with the script language for the parser. Now it can change the attribute according to the characters found. In theory it can solve your problem. I will try to create a script for this.

t-edson commented 9 years ago

It's done. The version 1.1 include a better implementation of this "basic parser script language" (I don't know how to call it). I think these lines can give you some help:

  <Token CharsStart='"' Attribute='STRING'> 
    <Regex Text='[^"]*' ></Regex>
    <Regex Text='"' ></Regex>
    <Regex Text='b' IfTrue='exit' atTrue='NUMBER' IfFalse='next'></Regex>
    <Regex Text='x' IfTrue='exit' atTrue='NUMBER' IfFalse='next'></Regex>
  </Token>

For the case of single quote, just add a similar definition:

  <Token CharsStart="'" Attribute='STRING'> 
    <Regex Text="[^']*" ></Regex>
    <Regex Text="'" ></Regex>
    <Regex Text='b' IfTrue='exit' atTrue='NUMBER' IfFalse='next'></Regex>
    <Regex Text='x' IfTrue='exit' atTrue='NUMBER' IfFalse='next'></Regex>
  </Token>

The key here is that you can use the "atTrue" parameter, who can define the attribute that the parser assign to the current token when the condition is TRUE. At the same way, there is the paremeter "atFalse".

As you can see, you could use more than two attributes for the same definition. You can even check if the chars inside the string are valid for the number token (I guess) .

Remember this only works on v1.1. It's no documented yet, because is still in develop. Soon will be documented in spanish and not so soon, will appear in english.

magorium commented 9 years ago

Hi Tito,

First aclaration: You cannot use this two definitions at the same XML: They use the same first char, so the last will overwrite the first.

In principle yes.

I run into this problem with the 'plain' string definition as well. It has to be declared after the number recognition in order to work correctly.

I would expect that the second token definition will not match as it expects the number to have been closed by a 'b' character and not with an 'x' character.

I have explicitly defined them as such as either one or the other definition matches. The engine can't have it both ways ;-)

Well documentation in english probably is not so clear, like it is in spanish, my native language

Yes, i have noticed that some things are a bit awkward to interpret. No need to worry about that too much as i can see for myself how things work. I'm not a native English speaker myself as well so that might add to my confusion as well ;-)

It's done.

Thank you very much.

That would for sure help me with this specific scripting language as it also contains labels (which are postfixed with a colon character and are allowed to be using numeric characters only. Even the dot character is allowed in there :-S ).

Currently the production code is using my SynFacil 1.0 branch, and i can't switch/upgrade so quickly. I'll give the new SynFacil 1.1 a try as soon as i am able too.

What i have concluded so far is, that it is really a pita to describe a language that uses post fix characters all over the place in order to determine the actual token type. In that regards, the new version of SynFacil will not help me out there.

If i may be so bold and allow me to make a suggestion:

If you are extending the documentation, could you take some additional paragraphs explaining how the extended match expressions and skipping certain lines influences the actual position of the cursor for the following executed match ?

Right now it is completely unclear to me where the actual position of the cursor is when a match succeeds or fails. I take it that the regular expression parser is responsible for that.

For now i more or less 'guessed' that when a match fails the cursor will be reverted back to its original location, but i have no idea what happens with this cursor position if i would move() to another regular expression line.

Kind regards,

t-edson commented 9 years ago

In principle yes.

One rule of SynFacilSyn is: Once defined an initial char for one token, it cannot (must not) be defined again with other definition. I don't know if you have a new way to work with SynFacilSyn. Probably I'm not understanding well.

What i have concluded so far is, that it is really a pita to describe a language that uses post fix >characters all over the place in order to determine the actual token type. In that regards, the new >version of SynFacil will not help me out there.

I think the new version can help you a lot. If you need some help, on one specific definition, than you can ask me.

Right now it is completely unclear to me where the actual position of the cursor is when a match >succeeds or fails. I take it that the regular expression parser is responsible for that.

Yes, it's not enough documented. I was worried for not to document the same thing in two different places. That's why the description on section 4.6.6 is very light,but it's expanded on 5.5.2. Anyway this information is expanded now in the new doc (only spanish by now).

t-edson commented 9 years ago

This definition can differentiate strings from binary or hexadecimal numbers, validating the contained digits on the string:

  <Attribute Name="BINARY" ForeCol="#ff0000"></Attribute>
  <Attribute Name="HEXA" ForeCol="#00ff00"></Attribute>

  <Token CharsStart="'" Attribute='STRING'> 
    <Regex Text="[01]+" IfTrue='move(4)' IfFalse='next'></Regex>
    <!--continue hexa scan-->
    <Regex Text="[0-9a-fA-F]+" IfTrue='move(6)' IfFalse='next'></Regex>
    <!--continue string scan-->
    <Regex Text="[^']*" ></Regex>
    <Regex Text="'" IfTrue='exit'></Regex>
    <!--Captured binary digits, check-->
    <Regex Text="'" IfFalse='move(-3)'></Regex>
    <Regex Text='b' IfTrue='exit' atTrue='BINARY' IfFalse='next'></Regex>
    <Regex Text='x' IfTrue='exit' atTrue='HEXA' IfFalse='exit'></Regex>
    <!--Captured hexa digits, check-->
    <Regex Text="'" IfFalse='move(-5)'></Regex>
    <Regex Text='x' IfTrue='exit' atTrue='HEXA' IfFalse='exit'></Regex>
  </Token>

magorium commented 8 years ago

First of all i need to make an apology for keeping you waiting for so long. I'm truly sorry about that.

Secondly, i wanted to thank you for your quick and helpful responses.

The last definition that you showed worked like a charm for me in a test-setup.

I can't thank you enough for that, as your example also clearly showed how to approach such 'strange' definitions in the future as i now have a far better understanding of what is expected by the SynFacil parser.

Not that i have anything really interesting to show for now, but do you have a special location where you store (or i can contribute) custom definitions so that others could use them as well ?

As far as i am concerned, this topic can be closed.

Kind regards, Ron.

t-edson commented 8 years ago

I'm glad the code has been useful to you.

I'll be happy if you send me syntax definitions for SynFacilSyn. You can send it by email or in the Lazarus forum, or just here. I will include them in the GitHub.