wasmerio / wai

A language binding generator for `wai` (a precursor to WebAssembly interface types)
Apache License 2.0
115 stars 13 forks source link

Support more languages / ensure Web compatibility? #21

Open dcodeIO opened 1 year ago

dcodeIO commented 1 year ago

There are some long-standing issues with Interface Types (or nowadays the Component Model) that make it a hazard for many popular languages and their Web integration, for example through eagerly diverging from WebIDL and instead picking blessed languages.

Now WAI's README states "enabling WebAssembly modules to be written in any language" as a goal, which I hope WAI will be able to achieve, yet it is still based on and "will be following upstream changes" of the exact proposal that has been carefully skewed to strongly disadvantage all but one family of programming languages. In particular and ironically, Interface Types does not support safe integration with Web APIs and JavaScript for any language that shares concepts and semantics common on the Web, and the problem is extended to use cases where a non-blessed language tries to interface with a module written in the same or a compatible language.

Possible improvements would be, for example, to drop the unnecessarily restrictive char type and instead specify an actually language neutral codepoint type, in turn splitting string into domstring (ECMAScript, Java, C#, Dart, Kotlin, etc.) and usvstring (Rust, C++) with a coercion in between (as WebIDL already does), or to revise the syntax a little so it is less inspired by one of the blessed languages but better fits with syntax commonly seen on the Web. Does this sound like something WAI would be OK with? :)

syrusakbary commented 1 year ago

Hey @dcodeIO, sorry for taking a bit to reply.

Would it be OK for WAI to diverge from Interface Types, so more languages are supported?

Yup. 100%. WAI needs to be completely agnostic between languages. We started from Interface Types because it was way easier for us to start with. But WAI aims towards unbiased compatibility with any language.

Would it be in the interest of WAI to take compatibility with the Web platform into account?

Yup. 100% as well. Web needs to be taken into account. Specially because one of the main targets of WebAssembly is (and will still be) the web in the foreseeable future.

Possible improvements would be, for example, to drop the unnecessarily restrictive char type and instead specify an actually language neutral codepoint type, in turn splitting string into domstring (ECMAScript, Java, C#, Dart, Kotlin, etc.) and usvstring (Rust, C++) with a coercion in between (as WebIDL already does)

I believe Interface Types as it is now supports one string type, and it will be treated as usvstring or domstring depending on the host string-encoding (although they use other nomenclature), so it might be possible to cleanly support it through the current Interface Types definition (note: I intentionally avoided calling it a standard, since standards require broad adoption by definition).

See the canonical string encoding definition:

canonopt ::= string-encoding=utf8
           | string-encoding=utf16
           | string-encoding=latin1+utf16

However, if this current definition is not good enough, or we need to deviate to achieve the two main objectives of WAI ("enabling WebAssembly modules to be written -and used- in any language"), that should be completely ok for the project.

So, where are the differences?

Currently, WAI and Interface Types differ in the DSL, as they started supporting for interface "dependency" through worlds while we are intentionally skipping it. Apart from this, they seem to be working towards the concept of ownership, which seems to us like an early optimization that might make usage more complex for the current use cases.

Looking at WebIDL

I believe that the web has done a lot towards universalizing APIs, and it would be a mistake to not leverage on most of that work. Specially because those APIs work anywhere (WebAudio, to WebGL, WebGPU and many more), and I believe that an easy mapping from WebIDL and WAI could be very beneficial towards having universal apps that run anywhere.

dcodeIO commented 1 year ago

Yup. 100%. WAI needs to be completely agnostic between languages. We started from Interface Types because it was way easier for us to start with. But WAI aims towards unbiased compatibility with any language.

πŸ‘

Yup. 100% as well. Web needs to be taken into account. Specially because one of the main targets of WebAssembly is (and will still be) the web in the foreseeable future.

πŸ‘

I believe Interface Types as it is now supports one string type, and it will be treated as usvstring or domstring [...]

This is a common misconception that I believe is a result of the sheer amount of not-so-honest communication surrounding the topic. In fact, neither utf16 nor latin1+utf16 does address this in any way, but merely tackles a (in comparison minor) performance aspect: Double re-encoding through a UTF-8 peephole. However, the elephant in the room remains that a list of chars (if char = Unicode Scalar Value) cannot, per definition, represent all valid strings in Java-like languages. The latter is a design blunder as serious as it gets in any engineering discipline, really.

In a nutshell, a DOMString can represent any list of Unicode code points except surrogate code point pairs, whereas a USVString (equivalent to Interface Types' string ↦ (list char)), cannot. Means in practice: Only UTF-8-based languages will function correctly, since UTF-16 is rarely used in practice and almost all the languages commonly considered to use UTF-16 actually use WTF-16 / DOMString (a superset of UTF-16) due to how their string APIs are designed – and today cannot simply be backwards-incompatibly broken.

That leaves two bad options for affected languages and JS interop: Either a) trap when an unsupported string is transferred between two modules written in the same language, when interfacing with a compatible one, or JavaScript or b) replace unsupported code points with U+FFFD in such cases, even if unnecessary. Interface Types has chosen the latter, then, for affected languages, leading to strings crossing boundaries not comparing equal anymore, not looking up the correct bucket in a hash map, or any other such hazard.

And indeed, that's why WebIDL has two semantically distinct string types: Because these are semantically distinct. In a broader language context, DOMString maps to languages like JavaScript, C#, Java, Dart, Kotlin, etc. that evolved from UCS-2 to UTF-16 but figured that UTF-16 doesn't quite cut it for them in practice, while USVString maps to newer languages like Rust that insisted on UTF-8 from the start and were OK with disallowing surrogate code points.

Long story short, if WAI supported both usvstring ↦ (list char) and a superset domstring with a coercion in between that is only performed when absolutely necessary, then it would mirror WebIDL. Any functionally equivalent definition would do as well, of course.

I suspect that there are more questionable choices in Interface Types that skew it towards one family of programming languages, but so far I haven't gotten to any of them. To me, strings are so very basic and critical that they seem like the decider for whether WebAssembly will indeed be polyglot. If not sorted, Java-like languages will likely never gain a foothold.

For comparison, see this stringref issue (not linking directly) that, perhaps surprisingly to some, backs the argument:

Efficient interop with a web embedding is a goal, so we have to deal with the possibility of lone surrogates.

there might be existing programs that (intentionally or accidentally) use lone surrogates, and people might now want to compile these programs to Wasm; it would be quite unfortunate (and, in fact, might well block deployments) if the Wasm version didn't behave the same as the original (or possibly even didn't work at all) because Wasm disallowed lone surrogates.

syrusakbary commented 1 year ago

Ok, here's the idea that I have regarding strings. Basically, the challenge I see is how both AssemblyScript (DOMString) and Rust (USVString) can use the same definition of WAI. If we use different string types, then the WAI (by definition) will be different for each (which is non ideal I believe). However, I think it may be possible to "use" only one string type (is a fake type), that maps to the correct one once the program is compiled.

On that front, this string type will be a DOMString or USVString depending on the program.

Let's say that we want to enable the WebAudio API to be runnable both in the servers and in the clients.

There are different use cases to analyze:

The AssemblyScript program would be compiled to a Wasm indicating that the string type in this program is expected to be treated as DOMString. The Rust program would be compiled to a Wasm indicating that the string type in this program is expected to be treated as USVString.

AssemblyScript program using WebAudio WAI running in a browser

AssemblyScript -> Wasm generated indicating that string type is DOMString -> gets used in the browser as DOMString -> No conversion required whatsoever

AssemblyScript program using WebAudio WAI running in a Rust server

AssemblyScript -> Wasm generated indicating that string type is DOMString -> The Rust server converts the DOMString into a USVString to operate with the program and viceversa

Rust program using WebAudio WAI running in a browser

Rust -> Wasm generated indicating that string type is USVString -> The browser converts the USVString into a DOMString to operate with the program and viceversa

Rust program using WebAudio WAI running in a Rust server

Rust -> Wasm generated indicating that string type is USVString -> gets used in the Rust server as USVString -> No conversion required whatsoever


Thoughts @dcodeIO ?

dcodeIO commented 1 year ago

Right, this gives us the four cases

Source string Destination string Description
DOMString DOMString All good
USVString USVString All good
USVString DOMString All good, because DOMString βŠ‡ USVString
DOMString USVString Loses information (lone surrogates are replaced with U+FFFD)

Given that different kinds of modules might be merged and such modules might become optimized before being described with a WAI interface, it seems sensible to indicate the string type per parameter/return to be flexible enough. One way to indicate are distinct type names like domstring and usvstring (with an automatic coercion in between), which I guess is the most straight-forward. Another is to have just one string type with an optional passthrough=true flag or so that, if both ends specify the flag, permits DOMString semantics while otherwise falling back to USVString*. Do you have another indication mechanism in mind?

(*) Perhaps inverting this option makes sense even, so string in WAI and anticipated stringref in Wasm work the same out of the box. The flag could then be wellFormed=true or sanitize=true to enforce coercion to USVString.

dcodeIO commented 1 year ago

Alternatively, say we'd build upon Interface Types in a minimum way while also striving to make it compatible with the current state of stringref, a potential set of changes could be

-char
-string ↦ (list char)
+codepoint
+string ↦ (list codepoint)

Here, char, which cannot represent some code units in DOMString, is replaced with codepoint, which can. Introduces an oddity.

 canonopt ::= string-encoding=utf8
-           | string-encoding=utf16
-           | string-encoding=latin1+utf16
+           | string-encoding=wtf16
+           | string-encoding=latin1+wtf16

Here, the encodings utf16 and latin1+utf16 are replaced with stringref's wtf16, matching Web/language reality.

The prime oddity in this design is that there are some (list codepoint)s that are invalid in all of the given encodings, namely if such a list contains a surrogate codepoint pair (in UTF-8 that's invalid, in WTF-16 a pair would have been decoded to a supplementary codepoint early). Is a non-issue as long as strings are only lifted with a canonopt encoding, but cannot be guaranteed since a user might use (list codepoint) directly. So this must be addressed somehow, either by taking such sequences into account when lowering, or by making string, say, an algorithmically described type instead of a (list X).

Michael-F-Bryan commented 1 year ago

If we use different string types, then the WAI (by definition) will be different for each (which is non ideal I believe). However, I think it may be possible to "use" only one string type (is a fake type), that maps to the correct one once the program is compiled.

This might seem like a dumb question, but is using different string types actually a non-starter? The problem seems to be that we want to use the name string for multiple different types of data, then we layer complexities like a module-wide implicit option (canonopt) and implicit coersions on top to make it work.

As a Rust programmer, I would be perfectly fine with having different string types in my interface (e.g. the Rust generator might call them wai::DomString, wai::UsvString, and std::string::String). Different objects contain different data and have different semantics, so they should get different types.

As long as there is some sort of conversion to simplify switching between string types it'd be fine. In some ways this could be better because I might be able to use the wai::DomString directly instead of first converting to a std::string::String.

Regardless of what solution we come up with, it's always going to be a leaky abstraction with the potential for performance cliffs (e.g. due to lots of implicit coersions) or weird gotchas (e.g. strings no longer being equal after being sent to/from wasm)... So, why not let the user deal with those footguns upfront by dropping string and making them pick something with the right encoding?

dcodeIO commented 1 year ago

Spec-wise, when using a single string type (defined as the common denominator "list of code points except surrogate code point pairs"), we'd end up with one set of string-encodings where sanitization is implicit in the encoding used when lowering (lowering to utf8 sanitizes, the others do not):

canonopt ::= string-encoding=utf8
           | string-encoding=wtf16
           | string-encoding=latin1+wtf16

A WAI interface would then define type string with the appropriate string-encoding, unless the encoding is the default.

When using two distinct types domstring and usvstring, sanitization is implicit in the type used, whereas each then naturally maps to its applicable string-encodings:

canonopt_usvstring ::= string-encoding=utf8

canonopt_domstring ::= string-encoding=wtf16
                     | string-encoding=latin1+wtf16

or

canonopt ::= usvstring-encoding=utf8
           | domstring-encoding=wtf16
           | domstring-encoding=latin1+wtf16

Here, a WAI interface would define both type and encoding, unless the encoding is the default. Just pinning this here in case it helps to visualize the effects of either direction.