Open thaliaarchi opened 7 months ago
My understanding is that FP internally uses the PCRE library (not sure whether pcre1 or pcre2 is used in production FP), with all/most of its syntax. I wouldn't be surprised if the most compatible solution would be to get a linkable (as in, buildable for wasm) build of the library itself.
Unfortunately I don't know any good reference documentation, but instead we can cross-reference avmplus repo for this: https://github.com/adobe/avmplus/blob/master/core/RegExp.cpp https://github.com/adobe/avmplus/blob/master/core/RegExpObject.cpp
(you can also see minor weird edge cases being explicitly handled there, like only (?P<
triggering filling named groups' fields, despite (?<
also being supported by pcre AFAIK)
If matching performance is not a goal for Ruffle
We generally put correctness first :) (and generally, other aspects of Ruffle are slow enough that I don't think we need to microoptimize here)
What is the relationship between avmplus and Adobe Flash Player? Was avmplus a component that was included directly? Included with proprietary changes? Or was it an open-sourced version of components in Adobe Flash Player, that was occasionally synced? Or something else?
I've reviewed the AS3 spec and each of the similar projects listed in Ruffle's Helpful Resources for how they handle RegExp
:
RegExp
semantics.ASRegExp
. The current design was written just before ECMAScript 2015 was published, so probably targets ECMAScript 5.1 RegExp
semantics. The prior design delegated in ASRegExp
to the XRegExp library, which converts its own extended syntax to JavaScript syntax. The current design indicates, that it fixes more tests than XRegExp, which implies that XRegExp syntax was not a design inspiration for the AS3 language authors.ASRegExp
. Their initial version is copied verbatim from Shumway and changes have been made since. Before @awayfl/avm2 was extracted as a separate package, it existed as a subdirectory of @awayfl/swf-viewer, where the git history continues.RegExp
. WAFlash is closed-source, so I did not investigate it.The approaches in Shumway and AwayFL remind me of how Scala.js compiles regular expressions (talk, release notes). They compile Java regex patterns to semantically equivalent JavaScript patterns, so that the native RegExp
can be used, with all the browser optimizations that come with it.
I think a port of the Shumway approach to Rust, with fixes from AwayFL as appropriate, would be easiest for Ruffle. Shumway's algorithm was written for ECMAScript 5.1, so the only differences should be only those introduced by regress implementing ECMAScript 2018. It would allow some modern regular expression features, that AS3 never had, but would better work around the other differences. It would be strictly better than Ruffle's current situation, but not absolutely perfect.
@thaliaarchi Thanks for looking into different approaches! Sadly I don't know enough to have anything to add here. But I'm looking forward to improvements in this area since it may help with #14938 (and a few other issues you saw already). Please do join our Discord server if you can - it's a lot easier to get feedback and bounce ideas off the devs there!
As for avmplus, we don't really know its exact relationship to Adobe Flash Player, but it's generally a good reference implementation for the parts of FP that it does implement.
Thanks for referring me to the Discord. I'm not actively working on this, since researching these regexp engines reactivated my interest in historical engines. Once I finish my archival work for the Plan 9 regexp engines, I'll come back to this. That feels easier, since avm2 RegExp was never formally specified or open-sourced. But, my sibling wants to play Ewok Village, which is blocked by this issue, so I haven't forgotten :). I'll lurk for now on Discord, but be more active when I have more questions.
Sorry, I missed your earlier questions :(
What is the relationship between avmplus and Adobe Flash Player? Was avmplus a component that was included directly?
Our understanding is that contents of avmplus repo were directly included as part of FP implementation at some point in time. You can treat it as sort-of Chromium vs Chrome parallel? We've been treating avmplus repo as-if it was FP's AVM2 source and so far we haven't observed any contradictory behavior, I think.
So my personal understanding of AS3's RegExp semantics is "it does whatever PCRE does".
Okay that's good to know and is really helpful. I think avmplus modifies their PCRE copies, so I'll start with finding where it diverges from upstream PCRE to get a sense for what they changed.
AS3
RegExp
uses the same syntax as ECMAScript 3, according to the Developer's Guide. However, there are instances where the syntaxes diverge. Furthermore, Ruffle usesregress
, which supports the syntax of ECMAScript 2018.I've surveyed the Ruffle issues and PRs for anything with “regex” or “regexp”. All issues are cases, where ECMAScript 3 and AS3 syntaxes differ:
(?P<
>)
named captures exist in AS3 (#13278, #10395, #10511). ECMAScript 3 has no named captures (see 15.10.1). ECMAScript 2018 has(?<
>)
named captures (see 21.2.1)./
/x
extended flag exists in AS3 (#13965), but not ECMAScript 3 (see 15.10.4.1) or ECMAScript 2018 (see 12.2.8.1).If sticking with an existing library, keeping
regress
would be better than switching toregex
, becauseregex
deliberately does not support backreferences—a far more significant feature than any of those delineated above—and its syntax is derived from RE2 rather than ECMAScript, so has larger differences.Is there a document that better specifies the syntax of AS3
RegExp
, such as a language specification or standard implementation? (I am new to AS3.) The mention in the Developer's Guide does not seem normative and has contradictions.I've done significant work on regular expression engines and have been bitten before by differing syntaxes between languages, so that's a problem I'm interested in tackling in a general-purpose way, and Ruffle could benefit from that effort. Now that
regex-automata
exposes its HIR, other crates could handle parsing and generate HIR. If a backtracking engine were added toregex-automata
, it could fallback to it when backreferences are used, while still having the extremely fast performance ofregex
when not using backtracking. If there is interest in Ruffle, that could be my motivation to pursue this.An easier approach would be to extend
regress
to conditionally handle AS3 syntax, since it's already close. If matching performance is not a goal for Ruffle, like it is forregex
, then this would be fine.