tno-terminology-design / tev2-tools

The Terminology Engine (v2) is a set of specifications and tools that caters for the creation and maintenance (i.e. curation) of terminologies. This repository contains the sources for the tools.
Apache License 2.0
2 stars 3 forks source link

Interpreter REGEXes #21

Closed RieksJ closed 10 months ago

RieksJ commented 11 months ago

The TEv2 specifications require interpreters to use PCRE regexes. However, it seems like the implementation uses ECMA (JavaScript). They are not the same - see https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816#feature-comparison for a feature comparison.

There are specific features, such as the lookbehind, that is supported by PCRE but NOT supported by ECMA.a The lookbehind is actually used, e.g. here, there, and here

We need to decide what to do. I personally think PCRE is the way to go because it has more features than ECMA and we need them (like the lookbehind). However, if there is a good reason not to, let's hear it.

Currently, the documented regexes are PCRE, and the implemented ones in ECMA. They differ, e.g., in how named groups are done. ANd also apparently that ECMA execution simply ignores syntax that it doesn't know

Ca5e commented 10 months ago

This isn't causing any issues currently because we're using ECMAScript 2022 in our project. The feature comparison is about 7 years old. PCRE also supports the (?<name>...) syntax for named capturing groups that newer versions of ECMA also support as opposed to the (?P<name>...) that we've also seen (ECMA < 2018 does not support this feature at all). Lookbehinds are now also supported. I can't find any talk about actually using PCRE in JavaScript, I suspect looking into doing this involves way more knowledge about regular expression engines than is common. All of the current regexes in the code are ECMA, but are compatible with PCRE.

RieksJ commented 10 months ago

Ok. That's a relief. I could not find this on the Internet.