secondlife / jira-archive

2 stars 0 forks source link

[BUG-234897] New LSL function: llRegex() #11733

Closed sl-service-account closed 7 months ago

sl-service-account commented 9 months ago

How would you like the feature to work?

Perform unanchored regex matching against a string, and returns the capturing groups.

Note: This is different from llRegexFound() in BUG-234898

Signature:

list llRegex(string pattern, string haystack, integer options);

Parameters

pattern = the regex to apply haystack = the string to be searched options = bitmap of regex options as follows:

Important Note: Only the first match will be returned. This reflects how the Match() method works.

This behaviour of the list is similar to how .Net RegularExpressions.Match.Groups work, see https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.match.groups?view=net-8.0

For example, invoking the following:

llRegex(
    "hugs?\s+(\S+)(\s+(\S+))?",
    "/me Hugs Claire happily with a gusto and hugs Zhaoying happily as well",
    REGEX_IGNORECASE
)

will return the first match parsed like so:

["Hugs Claire happily", "Claire", " happily", "happily"]

Note: The duplication in elements [2] and [3] is because non-capturing group is not used. If the regex engine supports non-capturing group (which is supported by .Net, by the way) then using the following regex pattern (notice the addition of ?: in the 2nd group):

"hugs?\s+(\S+)(?:\s+(\S+))?"

will result in:

["Hugs Claire happily", "Claire", "happily"]

Also not that the match is unanchored; if scripter wants to anchor the match to the beginning and/or end of the string, the scripter need to use ^ and $ respectively.

Implementation Details

The function should be implemented using the TimeSpan-equipped Match() method.

The TimeSpan object passed to the Match() method should be initialized to reflect maximum allowed time for the Regex to run, say 100-200 milliseconds, using the 5-parameter TimeSpan constructor as such:

TimeSpan(0, 0, 0, 0, 100);

This should prevent badly-formed regex from taking too much sim time. If a RegexMatchTimeoutException is raised, the function should just return an empty list.

Why is this feature important to you? How would it benefit the community?

Currently there is no 'native' regex feature to do string matching. However, regex is available indirectly using llLinksetDataFindKeys(). This requires some acrobatics to use, though, and likely will not be as performant as a native regex function.

In addition, llLinksetDataFindKeys() cannot do group capturing.

Having a native function that also perform group captures can potentially greatly simplify scripts, by replacing complicated substring search + substring extraction with a simple regex search + list indexing.

For example to parse the following lines in a configuration note card:

param1 = value1
  param2 = value2
param3=value3
param4= value4

The following regex provides easy match-and-extraction:

^\s*([^#=][^=\s]*)\s*=\s*(.+)

The parameter name will be at location [1] in the list, while the parameter's value will be at location [2]

(The incantations in the first capturing group means the first non-whitespace character must be neither "#" nor "=", followed by zero or more characters that are neither "=" nor whitespace. This will skip lines whose first non-whitespace character is a pound sign, a common convention for comments.)

Example Usage in Code

Parsing a notecard config


//
string CONF = "configuration.conf";
key gReq;
integer gIdx;

default {
    state_entry() {
        gReq = llGetNotecardLine(CONF, (gIdx = 0));
    }
    dataserver(key queryid, string data) {
        if (gReq != queryid) return;
        if (EOF == data) state Operational;
        list matches = llRegex("^\\s*([^#=][^=\\s]*)\\s*=\\s*(.+)", data, 0);
        if (matches != []) {
            string param = llList2String(matches, 1);
            string value = llStringTrim(llList2String(matches, 2), STRING_TRIM);
            llLinksetDataWrite("c:" + param + ":", value);
        }
        llGetNotecardLine(CONF, ++gIdx);
    }
}

state Operational {
    state_entry() {
        list confkeys = llLinksetDataFindKeys("^c:", 0, 0);
        integer i = -llGetListLength(confkeys);
        string k;
        do {
            k = llList2String(confkeys, i);
            llOwnerSay(k + " = " + llLinksetDataRead(k));
        } while (++i);
    }
}

As you can see, this script is now very efficient because:

Links

Duplicates

Original Jira Fields | Field | Value | | ------------- | ------------- | | Issue | BUG-234897 | | Summary | New LSL function: llRegex() | | Type | New Feature Request | | Priority | Unset | | Status | Closed | | Resolution | Duplicate | | Labels | scripting | | Created at | 2023-12-28T01:20:13Z | | Updated at | 2024-01-17T19:51:57Z | ``` { 'Build Id': 'unset', 'Business Unit': ['Platform'], 'Date of First Response': '2023-12-28T08:31:45.567-0600', 'How would you like the feature to work?': 'Signature:\r\n\r\n list llRegex(string pattern, string haystack);\r\n\r\nReturns a list of matches:\r\n\r\n* First element (0) is always the part of the string that exactly match the regex pattern\r\n* Subsequent elements are the capturing groups (if any)\r\n* If no matches are found, the function returns an empty list\r\n\r\nThis behaviour of the list is similar to how .Net RegularExpressions.Match.Groups work, see https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.match.groups?view=net-8.0', 'ReOpened Count': 0.0, 'Severity': 'Unset', 'Target Viewer Version': 'viewer-development', 'Why is this feature important to you? How would it benefit the community?': "Currently there is no 'native' regex feature to do string matching. However, regex is available indirectly using llLinksetDataFindKeys(). This requires some acrobatics to use, though, and likely will not be as performant as a native regex function.\r\n\r\nHaving a native function that also perform group captures can potentially greatly simplify scripts, by replacing complicated substring search + substring extraction with a simple regex search + list indexing.", } ```
sl-service-account commented 8 months ago

JIRAUSER341305 commented at 2023-12-28T14:31:46Z, updated at 2024-01-07T05:51:28Z

TL;DR Version: add a start offset parameter, and return list of start index and string pairs for each group (0+), plus end+1 index (and just the single integer -1 for no matches).

I most definitely want the match indexes — index and string for each group would work.  It's not enough to just look for the found group strings in the original string (there could be more than one match).  An absent group should probably start at -1 (we're already used to handling -1's in LSL).  With the options I suggest below, this could just remain the strings by default.

The option to get the end index rather than the matched string would be a definite positive when dealing with larger strings — not the least of which being groups can overlap causing a potentially large chunk of string to get duplicated a couple times in the result list.  Group indexes also allow you to easily extract various sub-ranges composed from matched group boundaries.  These should be the numbers you would pass directly into llSetSubString (ie, LSL inclusive ranges) to extract the corresponding string.  If there's no scope for an option here, then just return the indexes — it's trivial to get the strings from the indexes, but impractical to go the other way.  For absent groups, a pair of -1's would be a good choice.

It'd be really helpful to have the end index tacked onto the end of the list — even in indexes mode, this should be the character immediately after group 0 (not the same value as given for group 0's end), suitable to feed directly back in in a "find the next occurrence" situation.  This should be present in strings mode also to keep this use case clean and easy; it's always the item at -1 in the list, there's always exactly one additional item for loops to discount, and the number of items found is still just the list length integer divided by the stride.

Should have a start offset parameter to go with the above (everything should have a start offset parameter).  The start parameter will let it take the place of an additional "find all" version (such as llRegexFound); avoids the string chopping, also avoids "but I want the groups too" on a find all, and the duplication of everything else in common between the two functions.  And it's really not a complex loop to find them all (the benefit of "find all" of course, being caching of the constructed regex, but hopefully LSL will be doing that regardless).

The options could be an integer and separate start offset, but should probably be a list for extensibility (and this function is reasonably heavy-weight anyhow)…  I'd additionally like to see the indexes option be a bitmap of three flags; start index, end index, and string — you can have all three if you want — with constants for the three bits (if all three are off, perhaps place a boolean for "was found" in the list), plus an additional option to include the groups or not, and of course the "find all" option.

If the options are a list, including the groups should be a boolean option defaulted to true in single-match mode, and false in "find all" mode.  The "find all" option could be a "match count" instead, to limit it to only returning a reasonable number of matches (and you can loop it to find more in batches).  The default "indexes mode" would be start+string+groups in single match mode, but just start+string in "find all" mode (ie. by default it omits matched groups on a "find all", but you could include them if you wanted to).

For the options as an integer bitmap case, I'd suggest:

REGEX_MATCH_START = 0x1

REGEX_MATCH_END = 0x2

REGEX_MATCH_STRING = 0x4

REGEX_MATCH_GROUPS = 0x8

REGEX_MATCH_REPEAT = 0x10

An alternative to REGEX_MATCH_GROUPS, would be to include a second set of the three index selection flags, with the second set being for the groups.  Specifying no index information for groups, indicating you don't want the groups at all.  And separate options for group 0 is potentially useful; I often end up ignoring group 0, and it'll be the longest string (may not want to substitute with a boolean in this case, though it doesn't matter much).

No matches should probably be an empty list, but a single -1 for "start of next search" would make sense also, and retain consistency.  The question is basically a choice between no match being; a falsey list, a first value of -1, a total list length of 1, or a next start index value of 0 (empty list has an implied 0 at -1).  I am undecided on this point, but the empty list favours a very simple structure like {}while(next=llRegex(…)){}, at the expense of some consistency..  But testing for a list length of 1 is next simplest, being just something like {}next!=[0]{}.

(@primerb1: whether they are merged or not does not actually matter, I suggest combining them simply because all the other functionality of this comment can readily be applied to both equally (even if it arguably shouldn't), collapsing the difference down to the presence of a single flag, with I believe, negligible additional complexity.  Also, regarding precedent with llSetPrimitiveParams, you're right, but not in the way you suggest — SPP and co are a fairly obvious hierarchy, rather than duplicated functionality.  That does however bring up a point I perhaps should have specifically raised…)

This suggestion also supports one other very common use case the original llRegex of this issue does not; splitting the loop across events.  Because it essentially turns llRegexFound into sugar for calling llRegex in a loop and concatenating the results (and essentially how I would expect it to be implemented), you can just use llRegex in an iterative fashion instead, rather than either storing the result of llRegexFound in a global for the duration, or doing the entire search again every single time it's needed (for the "saving" of not adding a start offset parameter — which is both trivial, and for which there is thankfully increasing precedent).  That doesn't mean llRegexFind is not useful, sometimes it's preferable to keep the results rather than the source, and having a function (or option as I presented above) for that may will be convenient.

sl-service-account commented 8 months ago

JIRAUSER342641 commented at 2023-12-29T02:32:12Z, updated at 2023-12-29T02:41:19Z

@Bleuhazenfurfle I've added your suggestion to BUG-234898

The way I see it, the llRegex() function's purpose is mostly to perform extraction of capturing groups, while llRegexFound() is to perform an index search of the matches.

The output is designed to have as much as possible analogy to existing .Net facilities to reduce work needed for implementing the function.

Should there arise need to do more advanced regex legwork that aren't covered by llRegex() and llRegexFound(), a new function can be created.

After all, duplication of functionality already has a precedence: llSetPrimitiveParams() and llSetLinkPrimitiveParamsFast() have significant overlap in effect.

sl-service-account commented 8 months ago

Gwyneth Llewelyn commented at 2024-01-13T01:36:52Z, updated at 2024-01-13T01:59:39Z

Oh, I definitely love this feature request and I'm looking forward to seeing it implemented :)

However, let me add something here: you based your request on the assumption that Linden Lab's servers have been compiled in C# and follow .Net's programming guidelines.

In fact, Second Life slightly predates .Net — both were officially "launched" in 2002, while SL was in beta testing, but some of the earliest code most definitely existed before Microsoft released their earliest versions of .Net.

AFAIK, Linden Lab never really officially said what language was used for the server software. The issue was raised during a short debate ca. 2007, at the time LL released the viewer code as open source, and it was mostly cross-platform C++ (not C#).

Most notably, before the open-source release of the code, the OpenSimulator community assumed that the viewer was written in C#, because that's what they were more familiar with, and because at that time, thanks to the Mono project, it was reasonably possible to write a reverse-engineered version of the SL server in C# that could run natively under Windows and any other system that supported Mono (namely, Linux and macOS).

Around 2007 or so, former employee Babbage Linden, confronted with the choice of LL's programming language of their servers, sort of hinted that it was "mostly C". But since he — as all others who ever saw LL's server code — was under an NDA at the time, it's expectable that he couldn't elaborate further than that.

The point here is that although we all know that Linden Lab changed their scripting engine to run under Mono, there are several levels of subtleties below that layer — especially because the SL Grid still runs non-Mono scripts (at least, as of 2024...). It is therefore a very wild assumption that LL, for their Regex engine, must be following Microsoft's .NET guidelines. They might — or they might {}not{}. {}We{}, the residents, simply cannot know for sure. We can speculate that this is very likely the case, but it will remain a conjecture, and it won't be easy to extract a statement from a Linden Lab employee who has seen the code — exactly because the server code is a trade secret and they're not at liberty to discuss it.

As such, IMHO, I would recommend in your proposal not to "assume" anything about whatever Regex engine LL is using, beyond what they've already told us (and documented on the SL Wiki). More precisely, assuming that X or Y is "easier to implement" because LL "must" be following Microsoft's guidelines... well, it's delving into much speculation, unless, of course, you have direct knowledge of how the Linden Lab server code has been written, and what technologies it uses (again, beyond those we already know they use).

Now, taking into account that I'm not a professional programmer (in the sense that I'm not regularly paid for doing any programming), and that I just know a smattering of C# (enough to tinker with OpenSimulator/Libremetaverse, but that's as far as I know), and only familiar with the pre-1992 C++ standards, here is an example of what I mean:

You suggest the introduction of certain option flags such as {}REGEX_IGNORECASE{}, {}REGEX_MULTILINE{}, {}REGEX_SINGLELINE{}, etc., because these are defined according to MS's .Net guidelines, thus mapping those values directly to the native C#/.Net implementation (and therefore "easier to do").

By contrast, I would argue that the "better" way to implement those flags would be to eliminate them completely from the "options" thingy and, instead, use the standard notation for those embedded in the regex itself, e.g. {}(?i){}, {}(?m){}, {}(?s){}... etc., because, well, who knows, Linden Lab, at their discretion, might use a different Regex engine which does a better/faster job at parsing regexes, and using that notation would fit the regex engine better.

(One might argue that, during LSL compilation, you could have a pre-processor stage that would convert between formats, if necessary, to make the regexp engine happy, and so my example would be trivial to dismiss. That's not my point here.)

Why would this distinction be important? Well, just consider the following — let's assume that Linden Lab is toying around with the idea of having native LSL regex, but, since LSL is so crippled anyway, it decided to go for the fastest engine out there, at the cost of dropping some not-so-essential regex features (which would only be of interest to very pedant regexperts out there) — a typical example comes from Google, who used their own superfast regex engine, RE2. While this does not support the full range of PCRE2-style regex syntax, it does make a healthy attempt at providing the "core" API functionality as, well, PCRE2...

Therefore, I'd be wary to "recommend" LL any specific implementation, without knowing more about LL's choice of regexp engine and/or programming language/framework used to implement things server-side.