RFC: Glulx byte swap instructions

iftechfoundation / ifarchive-if-specs

Specification documents for the Glk, Glulx, and Blorb standards

15 stars 4 forks source link

RFC: Glulx byte swap instructions #13

Open dfoxfranke opened 2 months ago

dfoxfranke commented 2 months ago

This issue is a proposal to add byte-swap instructions to Glulx. Its purpose for now is just to take the temperature of maintainers. If there's interest in moving forward, I'll follow up with a pull request here to add an appropriate section to the specification, and with PRs to gluxe, quixe, and git to implement it.

This addition is motivated by a project I'm working on to translate WebAssembly into Glulx, making it possible to develop IF in any general-purpose language with a compiler that can target WASM, and have it run seamlessly on existing Glulx interpreters. Currently, an impediment to generating efficient code is that WebAssembly is little-endian while Glulx is big-endian, and swapping requires numerous instructions which would have to be executed before or after every main memory access.

I propose to add six new instructions to Glulx, and one new gestalt ID to indicate their availability.

swap L1 S1

Swap the bytes of L1 and store the result in S1. 0x01020304 becomes 0x04030201.

swaps L1 S1

Swap the two high bytes of L1 with each other, and the two low bytes with each other, and store the result in S1. 0x01020304 becomes 0x02010403.

astoreswap L1 L2 L3

Swap the bytes of L3 and then store it into the 32-bit field at main memory address (L1+4*L2). 0x01020304 is stored as 0x04030201.

aloadswap L1 L2 S3

Load a 32-bit value from main memory address (L1+4*L2), and store it in S1 with its bytes swapped. 0x01020304 is stored as 0x04030201.

astoreswaps L1 L2 L3

Swap the two low bytes of L3 and store them into the 16-bit field at main memory address (L1+2*L2). 0x0102 is stored as 0x0201.

aloadswaps L1 L2 S3

Load an 16-bit value from main memory address (L1+2*L2), and store it in S1 with its bytes swapped. 0x0102 is stored as 0x0201.

swap, aloadswap, and astoreswap will occur frequently in generated code and should ideally have single-byte opcodes. The 16-bit versions are less important and should have two-byte opcodes to conserve numbering space.

DavidKinder commented 2 months ago

The decision on this is ultimately up to @erkyrath, but as you asked me to comment (as Git maintainer) ...

First thought: How sure are you that this is sufficient to support a translation of WebAssembly to Glulx? I dimly recall, years ago, working with a DEC Alpha, which in principle could cope with either endianess. My recollection is that that caused all sorts of awkward corner cases. So are these opcodes based on an analysis, or more of a gut feel for what would be needed?

Second thought: Personally I've no objection to new opcodes, but I'm reluctant to support adding them without a real (as opposed to theoretical) use case, especially as they don't add any functionality, they allow a particular use case to run faster. But we already have a mechanism to handle speeding up particular functions that are used a lot: the accelfunc opcode. Currently this is used to re-implement some of the most used Inform 6 veneer functions inside interpreters, but there's no reason why it has to be limited to veneer functions.

Suggestion: If it were me, I would implement the swapping opcodes you want as functions in the Glulx code. Once you've got something that works it would be possible to profile it, and then add support to the main interpreters to replace those functions via the accelfunc opcode, if needed. During development it would be easy enough to add local support for accelerated functions to keep performance up, if that was an issue. That would mean that the result would work (if more slowly) on older interpreters and that interpreter updates would only be needed once you know that the project works.

dfoxfranke commented 2 months ago

Using accelfunc is an excellent idea, actually! I completely skimmed over that section while I was reading the Glulx spec because I had the impression that it was Inform-specific and not useful otherwise. Now I see that it's extensible and would fit into this use case quite nicely.

dfoxfranke commented 2 months ago

To answer your "first thought" paragraph: I'm confident that these operations (regardless of whether they're implemented as opcodes or through accelfunc) would be sufficient to address endianness issues, because WebAssembly is a load/store architecture which only has a handful of instructions which take memory operands, and those are the only instructions which have any defined endianness, and I've looked at all those and concluded operations would be sufficient to translate them cleanly. Just like in Glulx, in WebAssembly the stack is separate from memory and words on the stack are just words, with no inherent endianness. Comparing to your experience with the Alpha, translating from one VM which is always little-endian to another VM which is always big-endian is an altogether cleaner problem than having one machine that supports both and trying to get modules to interoperate despite different assumptions.

Anyway, I'm leaving this issue open for now, but going the accelfunc route I don't think it needs any action from @erkyrath at this time because the extensions to accelfunc can go into implementations first and the spec later. Accelfunc's behavior of "if the function number isn't recognized, do nothing" means implementers don't have to agree on anything in advance; if different implementations use different numbers for the same function, I can just call accelfunc with all of them, and if it's later standardized I can just add the standard number onto the list.

erkyrath commented 2 months ago

I don't have a lot to add here.

There's several ideas floating around for "how to integrate Inform code with other systems". I7's C back end; proposal for a C# I7 back end. I think there was a new I6 back end proposal at some point, although I can't find it now.

Compiling WASM to Glulx is obviously a possible path, but it's not obviously the best possible path or the one that will be successful. Given this, it seems premature to start messing with the spec.

I like the accelfunc approach for this.

In the past, we've also handed out ranges of opcodes for private experimentation. Not one-byte opcodes, though.

dfoxfranke commented 2 months ago

There's several ideas floating around for "how to integrate Inform code with other systems". I7's C back end; proposal for a C# I7 back end. I think there was a new I6 back end proposal at some point, although I can't find it now.

I haven't followed these efforts but it sounds like they have a different goal than I do. I'm not trying to translate Inform into other languages or make it interoperate with other languages. I'm trying to create a toolchain for authors who want to develop in something other than Inform, but want to be able to distribute their games as a Glulx or Glulx/Blorb file so that players have a convenient and familiar user experience. I suppose it could also enable you to compile something that Inform can make FFI calls into; that's not a goal of mine, but if it's something that Graham is interested in doing then I'll try to support him. It would require there be some concept of Glulx object files that a linker can combine into a complete story file, but that seems fairly doable and I could probably extend the assembler I just finished writing to support something like that.

In the past, we've also handed out ranges of opcodes for private experimentation. Not one-byte opcodes, though.

I'm sold on the accelfunc approach, so the opcodes won't be necessary, though would you mind giving me an allocation of accelfunc and accelparm numbers? Of course I don't care about operand size for these since they only need to appear once in the program.

curiousdannii commented 2 months ago

Are you aware of the glulx-llvm experiment? https://github.com/dfremont/glulx-llvm

It seems more ideal, skipping the whole wasm stage.

DavidKinder commented 2 months ago

I'm sold on the accelfunc approach, so the opcodes won't be necessary, though would you mind giving me an allocation of accelfunc and accelparm numbers?

This is very much @erkyrath's choice, but I'd favour only allocating numbers like this in the specification once there's a project out there that is at least somewhat useable by the community. No offence is intended, but opcode ranges have been specified for people before and in general nothing much has been done with them, probably unsurprisingly.

Looking at the non-standard opcode ranges listed in the specification (and wandering off topic, sorry), we have

0x1000 to 0x10FF for FyreVM. That is a real project, but it only ever had one custom opcode, 0x1000 fyrecall.
0x1100 to 0x11FF for Dannii Willis. I don't believe @curiousdannii ever used these?
0x1200 to 0x12FF for iOS extension features by Andrew Plotkin. Did you use these? Presumably yes :)
0x1400 to 0x14FF for "ZZO38". I'm pretty sure these have never been used.
0x7900 to 0x79FF for experimental features in Git. Oops! There are actually only two opcodes here, 0x7940 and 0x7941. I've never liked these, and I've filed an issue to remind me to the remove them. Once they're gone I'll create a PR for the specification to recast this as historical.

dfoxfranke commented 2 months ago

This is very much @erkyrath's choice, but I'd favour only allocating numbers like this in the specification once there's a project out there that is at least somewhat useable by the community.

I don't need them in the spec immediately, I just need something to use that won't collide with something allocated to someone else later, which can then maybe go into the spec eventually.

dfoxfranke commented 2 months ago

Are you aware of the glulx-llvm experiment? https://github.com/dfremont/glulx-llvm

I wasn't, but the approach is one that I considered and rejected. Maintaining an LLVM backend means keeping up with a fast-moving target, because the LLVM IR is constantly growing and many compilers (I'm thinking Rust here, which is what I'm developing in) demand the absolute bleeding edge. WASM changes more slowly, being a web standard that has to support many interoperable implementations. It has a lot of extensions now, but the extensions are mostly things that the compiler won't generate unless you ask for them and give it code that has a specific need for them. It's also easier to begin with, because WASM is already a lot closer to Glulx than LLVM-IR is.

dfoxfranke commented 2 months ago

Here's my plan, currently. I've made a list of all the WASM instructions that are going to be translated as function calls into my runtime library. 48 functions are needed; most of them that aren't related to byte-swapping are related to 64-bit arithmetic. That list doesn't include the SIMD instructions, which I'm not going to bother supporting in the first release, but will eventually. Some of these, e.g. the 64-bit bitwise operators, are certainly not going to benefit appreciably from acceleration. But, I plan to specify accelfuncs for all of the scalar functions (not the vector ops, though — there are another 223 of those and that's just ridiculous) and leave it up to terp maintainers how many to bother with. Just supporting swap will probably yield as much benefit as all the rest combined.

I definitely don't advocate cluttering the Glulx spec with all of these. I'm defining the base of the range of accelfunc and accelparm numbers as a compile-time constant: 0x574100 ("WA") for now unless/until Zarf asks me to use something different. When wasm2glulx is released, I'll post a specification for the behavior of these functions at a stable URL. Then I'll submit a PR here just adding one sentence to the spec: "Functions and parameters numbered $RANGE_START through $(($RANGE_START + 255)) are reserved for use by wasm2glulx; they are specified at $URL".

dfoxfranke commented 1 month ago

Quick update... I now have 100% of the WebAssembly 1.0 spec (and a bunch of the 2.0 draft) implemented in wasm2glulx and fully passing the compliance test suite. I'm now working on loose ends and documentation and plan to announce a release late this week, along with announcing a new port of Adventure that I made as a proof-of-concept for the tool. My initial guess that I'd need 48 runtime functions grew a bunch as I noticed more differences between WASM semantics and Glulx semantics. I haven't decided how many of these are worth defining accelfuncs for, but the final count will be 100ish. Of these, it remains my expectation that the memory load/store functions will be, by far, the most worthwhile for interpreters to implement. I've not profiled anything yet, but based on the density of load/store instructions in the Adventure port (roughly one in ten), I'm guessing that a typical game will see roughly a 2x overall speedup if its load/store functions are accelerated.

hanna-kruppe commented 1 month ago

Is there a tentative list of the functions you're currently considering for acceleration? A review phase before committing to them might be useful. There is a balancing act between how useful a given accelfunc is vs. how much work it is for interpreters to implement, test, and eventually execute accelerated functions. Outlining a very simple instruction sequence into a function just so it could get accelerated may be a bad trade-off:

It's a net performance loss (vs. inline code) on every interpreter that doesn't implement that specific accelfunc.
Even with acceleration, it might not be any faster because it's hard to make an function call as fast as simpler opcodes, even if the call gets diverted to an accelerated implementation.

In extreme cases, for something like i64 bitwise operations, it might not even save any code size vs. inlining. Two @bit{op} sp sp sp is 2x3 bytes, while every function call needs at least six bytes unless the callee address fits in a byte. And that's before accounting for any contortions required to return more than one 32-bit word from the function.

dfoxfranke commented 1 month ago

Stuff that's practical to inline is generally already inlined, but because of an architectural choice I made in order to make wasm2glulx simpler to write and maintain, even functions as simple as an i64 bitwise-and generally need a runtime call. That's because wasm2glulx never assigns anything to local variables in Glulx that aren't already assigned to local variables by the WASM. If there's something on WASM's stack which didn't get there from a local.get instruction, then it won't have a frame address in the Glulx code either. That means that if I need to bitand two 64-bit values on the stack, then I need to operate on the first and third words on the stack, and the second and fourth words, and doing this without a function call would require a messy sequence of stkroll instructions which would probably be slower than the call. So basically, there's no case in which I'm ever assigning something to a runtime function just for the sake of possibly being able to accelerate it.

Note, though, that I do optimize sequences of WASM instructions along the lines of

local.get 0
local.get 1
i32.add
local.set 2

This will compile to one Glulx instruction and not four.

hanna-kruppe commented 1 month ago

I understand what you're saying. I might have more thoughts on it after looking more into how wasm2glulx works (but I'll take those to the bedquilt repository since it doesn't really belong here). Even taking the current wasm2glulx architecture as given, I would still caution against including basically everything that wasm2glulx currently emits as a function call, for several reasons:

As interpreter author, I'd like to help wasm2glulx code run fast but I don't love the prospect of going through 100ish functions of wildly varying importance and either having to implement and test all of them, or having to make judgement calls about which ones are worth the trouble. It's much easier if you give me a curated list of those functions which empirically make a significant difference for real games. (This is also how the Inform-inspired accelfuncs were found: the spec doesn't include every Inform 6 veneer function.)
The more non-orthogonal functions are included, the more likely it seems that some of them become obsolete over time. It would be a shame to implement something in a bunch of interpreters and later slap a "some old files may call this function but it's probably not worth implementing any more" disclaimer on that part of the spec. (Even if these functions would not be specified directly in the main Glulx specification.)
Even beyond wasm2glulx, I think it would be great to have standardized and accelerated ways to spell operations like 32-bit clz/ctz/popcount, 32x32->64 multiplies, and little-endian memory accesses. These operations are very useful for a bunch of low-level algorithms and data structures but the cost of emulating them with the existing opcodes often seems too large to be worth the trouble. So I'd really like to see accelfuncs for these get specified and widely implemented. I fear that bundling them with so many other accelfuncs increases the odds that they'll get overlooked and not implemented in some interpreters.

As an example, let's consider integer comparisons. wasm2glulx currently has 22 distinct runtime functions for integer comparisons, with names matched by the regex i(32|64)_(eqz?|ne|[gl][et]_[su]). There may be good reasons why wasm2glulx currently has all of these, but it's far from obvious to me how useful and long-lived each of them will be. First, when the 32-bit comparisons feed into a branch, the best translation is to directly use Glulx's conditional jump opcodes. Indeed wasm2glulx seems to have some support for this, so it's not clear how important accelerating these functions is today. Second, for all I know, future versions of wasm2glulx could start emitting inline code for the 32-bit comparisons that don't feed into a branch. Third, I'm pretty sure eqz can be made redundant without any non-trivial changes to wasm2glulx. The 32-bit version never saves anything compared to eq(x, 0) (encoded as @callfii i32_eq src 0 dest). The 64-bit version could be spelled as i32_eq(low_half | high_half, 0), which would never be significantly worse and would be a win if the result is branched on (bitor + jz instead of call + jz) and on interpreters without acceleration for that function.

Again, I don't want to dig too deeply into how & why wasm2glulx works. I only want to highlight that there's a marginal cost and risk for every additional accelfunc, so I'd appreciate more curation. Even if you want to have a specification for all runtime functions that wasm2glulx currently uses, it would be useful to have a clear split between the ones that are really useful for interpreters to accelerate, and the ones that could be safely ignored. (But if I end up convincing you to have fewer runtime functions, I'd be happy to help brainstorm about which ones can be removed with little to no cost.)

dfoxfranke commented 1 month ago

I'll definitely consider have the spec include some sort of tier list. Tier 1 would be "these functions will substantially benefit practically every wasm2glulx program", which would include the memory/byteswap-related functions and little else. Tier 2 would be "these functions are broadly useful and can be accelerated substantially, but are unlikely to be hotspots, so the overall benefit for most games will be small". This would include clz/ctz/popcnt, 64-bit multiplication and division, and maybe a few other arithmetic functions. Tier 3 would be everything else.

The more non-orthogonal functions are included, the more likely it seems that some of them become obsolete over time. It would be a shame to implement something in a bunch of interpreters and later slap a "some old files may call this function but it's probably not worth implementing any more" disclaimer on that part of the spec. (Even if these functions would not be specified directly in the main Glulx specification.)

I won't include anything that seems like it has any chance of obsolescence. I don't think very much does, considering that most of these functions are 1:1 with some particular WASM instruction.

First, when the 32-bit comparisons feed into a branch, the best translation is to directly use Glulx's conditional jump opcodes. Indeed wasm2glulx seems to have some support for this, so it's not clear how important accelerating these functions is today.

Correct: when a 32-bit comparison is immediately followed by a br_if or if..else instruction in the input, this compiles directly to an appropriate conditional jump opcode. The runtime functions are only called when there's no immediately-subsequent branch instruction. And these runtime functions are so short and simple that I won't be surprised if an accelerated version ends up being slower, so I probably won't specify accelfuncs for these at all. If I do, they'll clearly be tier 3.