universal-ctags / ctags

A maintained ctags implementation
https://ctags.io
GNU General Public License v2.0
6.54k stars 624 forks source link

docs, lregex: treatment of newlines #3110

Open hirooih opened 3 years ago

hirooih commented 3 years ago

During working on PR #3109 I found description of the treatments of newlines might be wrong. But I might be wrong. Let me know what I am missing.

From Regular expression (regex) engine:

A more subtle issue is this text from the Regular Expressions chapter: “the use of literal s or any escape sequence equivalent produces undefined results”. What that means is using a regex pattern with [^\n]+ is invalid, and indeed in glibc produces very odd results.

The description of the specification including before and after the quoted sentence is as follows.

In the functions processing regular expressions described in System Interfaces volume of POSIX.1-2017, the is regarded as an ordinary character and both a and a non-matching list can match one. In the functions processing regular expressions described in System Interfaces volume of POSIX.1-2017, the is regarded as an ordinary character and both a and a non-matching list can match one. The Shell and Utilities volume of POSIX.1-2017 specifies within the individual descriptions of those standard utilities employing regular expressions whether they permit matching of characters; if not stated otherwise, the use of literal characters or any escape sequence equivalent in either patterns or matched text produces undefined results.

It does not say "What that means is using a regex pattern with [^\n]+ is invalid". I can find a description of special treatment of in the spec. Does this describe about an issue specific to the implementation of glibc?

And the the next sentence follows;

Those utilities (like grep) that do not allow characters to match are responsible for eliminating any from strings before matching against the RE.

In the Universal Ctags case this is similar to --regex-<LANG> what processes input line by line. --regex-<LANG> does not have to care setting of REG_NEWLINE, if I understand correctly. should be eliminated.

Never use \n in patterns for --regex-,

This is OK. But I don't understand the following senence;

and never use them in non-matching bracket expressions for --mline-regex- patterns.

First I don't understand what non-matching bracket expressions means. Of course brackets ([ and ]) should be paired. But I guess the sentence above means different things.

I think it is more portable to use ^ or $ than using \n because there are variations of line-break characters.

For the experimental --_mtable-regex- you can safely use \n because that regex is not compiled with REG_NEWLINE.

We can also say we have to use \n because that regex is not compiled with REG_NEWLINE. If I understand correctly, it is better to set REG_NEWLINE for --_mtable-regex-<LANG>, too.

masatake commented 3 years ago

Sorry to be delayed.

First I don't understand what non-matching bracket expressions means.

I guess this wrote about [^...].

I think it is more portable to use ^ or $ than using \n because there are variations of line-break characters.

Are you talking about CR and LF? The buffer used in regex matching is filled by functions defined in main/read.c. The functions normalize the line-break characters to '\n'. So we can use it '\n'.

We can also say we have to use \n because that regex is not compiled with REG_NEWLINE.

YES.

If I understand correctly, it is better to set REG_NEWLINE for --_mtable-regex-, too.

I'm not sure which one, setting or not setting, is better. Anyway, too many parsers in optlib/ assume that REG_NEWLINE is not set.

hirooih commented 3 years ago

@masatake san,

I've found <newline>, <period>, and so on are not displayed in my original post. I've fixed them.

I guess this wrote about [^...].

I see.

Are you talking about CR and LF?

Yes.

The buffer used in regex matching is filled by functions defined in main/read.c. The functions normalize the line-break characters to '\n'. So we can use it '\n'.

Good news. It will be better to be documented. I will take this.

During studying Perl regular expressions for #3036, I found I did not understand the treatment of newline correctly. I did not distinguish /s modifier and /m modifier.

And I need some more time to remember this issue:-) Give me some tme.

hirooih commented 3 years ago

@masatake san, I remembered the points of this issue.

First I withdraw the followings.

If I understand correctly, it is better to set REG_NEWLINE for --_mtable-regex-, too.

The point is the statement is wrong.

What that means is using a regex pattern with [^\n]+ is invalid,

As I cited;

“the use of literal s or any escape sequence equivalent produces undefined results”.

This is for "the individual descriptions of those standard utilities" and under condition "if not stated otherwise". A regex pattern with "non-matching bracket expressions", [^\n]+, is valid in general.

But you see it in glibc produces very odd results. In that case we leave the notice not to use it. I am curious how it works oddly. Is there a test case for it?

If we agree, let me send a PR for above.


BTW you already merged #3036, the Read the Docs has not updated yet.