ruffle-rs / ruffle

A Flash Player emulator written in Rust
https://ruffle.rs
Other
15.35k stars 792 forks source link

AS3 `RegExp` syntax incompatibilities #14651

Open thaliaarchi opened 7 months ago

thaliaarchi commented 7 months ago

AS3 RegExp uses the same syntax as ECMAScript 3, according to the Developer's Guide. However, there are instances where the syntaxes diverge. Furthermore, Ruffle uses regress, which supports the syntax of ECMAScript 2018.

I've surveyed the Ruffle issues and PRs for anything with “regex” or “regexp”. All issues are cases, where ECMAScript 3 and AS3 syntaxes differ:

If sticking with an existing library, keeping regress would be better than switching to regex, because regex deliberately does not support backreferences—a far more significant feature than any of those delineated above—and its syntax is derived from RE2 rather than ECMAScript, so has larger differences.

Is there a document that better specifies the syntax of AS3 RegExp, such as a language specification or standard implementation? (I am new to AS3.) The mention in the Developer's Guide does not seem normative and has contradictions.

I've done significant work on regular expression engines and have been bitten before by differing syntaxes between languages, so that's a problem I'm interested in tackling in a general-purpose way, and Ruffle could benefit from that effort. Now that regex-automata exposes its HIR, other crates could handle parsing and generate HIR. If a backtracking engine were added to regex-automata, it could fallback to it when backreferences are used, while still having the extremely fast performance of regex when not using backtracking. If there is interest in Ruffle, that could be my motivation to pursue this.

An easier approach would be to extend regress to conditionally handle AS3 syntax, since it's already close. If matching performance is not a goal for Ruffle, like it is for regex, then this would be fine.

adrian17 commented 7 months ago

My understanding is that FP internally uses the PCRE library (not sure whether pcre1 or pcre2 is used in production FP), with all/most of its syntax. I wouldn't be surprised if the most compatible solution would be to get a linkable (as in, buildable for wasm) build of the library itself.

Unfortunately I don't know any good reference documentation, but instead we can cross-reference avmplus repo for this: https://github.com/adobe/avmplus/blob/master/core/RegExp.cpp https://github.com/adobe/avmplus/blob/master/core/RegExpObject.cpp

(you can also see minor weird edge cases being explicitly handled there, like only (?P< triggering filling named groups' fields, despite (?< also being supported by pcre AFAIK)

If matching performance is not a goal for Ruffle

We generally put correctness first :) (and generally, other aspects of Ruffle are slow enough that I don't think we need to microoptimize here)

thaliaarchi commented 7 months ago

What is the relationship between avmplus and Adobe Flash Player? Was avmplus a component that was included directly? Included with proprietary changes? Or was it an open-sourced version of components in Adobe Flash Player, that was occasionally synced? Or something else?

thaliaarchi commented 7 months ago

I've reviewed the AS3 spec and each of the similar projects listed in Ruffle's Helpful Resources for how they handle RegExp:

The approaches in Shumway and AwayFL remind me of how Scala.js compiles regular expressions (talk, release notes). They compile Java regex patterns to semantically equivalent JavaScript patterns, so that the native RegExp can be used, with all the browser optimizations that come with it.

I think a port of the Shumway approach to Rust, with fixes from AwayFL as appropriate, would be easiest for Ruffle. Shumway's algorithm was written for ECMAScript 5.1, so the only differences should be only those introduced by regress implementing ECMAScript 2018. It would allow some modern regular expression features, that AS3 never had, but would better work around the other differences. It would be strictly better than Ruffle's current situation, but not absolutely perfect.

n0samu commented 7 months ago

@thaliaarchi Thanks for looking into different approaches! Sadly I don't know enough to have anything to add here. But I'm looking forward to improvements in this area since it may help with #14938 (and a few other issues you saw already). Please do join our Discord server if you can - it's a lot easier to get feedback and bounce ideas off the devs there!

As for avmplus, we don't really know its exact relationship to Adobe Flash Player, but it's generally a good reference implementation for the parts of FP that it does implement.

thaliaarchi commented 7 months ago

Thanks for referring me to the Discord. I'm not actively working on this, since researching these regexp engines reactivated my interest in historical engines. Once I finish my archival work for the Plan 9 regexp engines, I'll come back to this. That feels easier, since avm2 RegExp was never formally specified or open-sourced. But, my sibling wants to play Ewok Village, which is blocked by this issue, so I haven't forgotten :). I'll lurk for now on Discord, but be more active when I have more questions.

adrian17 commented 7 months ago

Sorry, I missed your earlier questions :(

What is the relationship between avmplus and Adobe Flash Player? Was avmplus a component that was included directly?

Our understanding is that contents of avmplus repo were directly included as part of FP implementation at some point in time. You can treat it as sort-of Chromium vs Chrome parallel? We've been treating avmplus repo as-if it was FP's AVM2 source and so far we haven't observed any contradictory behavior, I think.

So my personal understanding of AS3's RegExp semantics is "it does whatever PCRE does".

thaliaarchi commented 7 months ago

Okay that's good to know and is really helpful. I think avmplus modifies their PCRE copies, so I'll start with finding where it diverges from upstream PCRE to get a sense for what they changed.