iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.86k stars 624 forks source link

Add mechanism for wider bit widths in VM imports/change methods/etc...? #8051

Closed benvanik closed 2 years ago

benvanik commented 2 years ago

Related to #8043 and having 64-bit offsets/indices.

Right now we need to decide bit widths when defining imports as part of the signature - imports like hal.buffer.load take an i32 offset as part of their signature. This means that even if the compiler generates i64 registers and the runtime has the i64 instructions present there's no way to call across the import boundary without splitting the i64s into i32s.

The easiest option is to make any import that could be 64-bit take a lo/hi register pair. A 32-bit module would pass 0 for the hi word and a 64-bit module would pass both parts. By having separate registers a 32-bit module would have a zero register which it almost always does anyway and reuse that for all hi words. There's the overhead of an additional argument (2 bytes on the instruction, a byte in the cconv, an extra loop in the marshaling, etc) but all the same code paths as today.

A refinement of that is to support i64 on the calls even when the i64 instructions are not present. Since the VM encodes i64 registers as two contiguous aligned i32 registers the same base register ordinals would be valid regardless of whether they are interpreted as i32 or i64. A 32-bit module compiled would pass rN, rN+1 and just have to ensure rN+1 is 0, while a 64-bit one would pass rN, rN + 1 as the natural i64 value it has. This saves the overhead of the additional argument but does possibly introduce additional register zeroing and complexity in the compiler to ensure registers routing to calls are setup even when i64 instructions don't exist.

Another alternative is to introduce a dynamic register size (z cconv encoding) that adapts, however then 32-bit modules could not run on 64-bit devices. I hate it.

The 💥 option is to just always enable i64 support (remove the ability to disable in compiler and runtime). That's effectively a rewrite but may be something we want to do anyway - the bytecode binary format and dispatch loop needs to be reworked anyway. The binary size addition of the i64 ops is relatively low and if things like torch are always going to pass in i64 sizes and we need it anyway it'd simplify things a lot.

powderluv commented 2 years ago

I like the The 💥 option is to just always enable i64 support (remove the ability to disable in compiler and runtime). That's effectively a rewrite but may be something we want to do anyway though I understand the cost.

benvanik commented 2 years ago

The balance is that what may be an inconvenience for one set of users (64-bit systems with 100GB+ of memory) can be a deal-breaker for others (32-bit systems with ~256KB of practically usable rw memory). We just have to cut carefully through that spectrum - totally possible with some more coffee and time :)

benvanik commented 2 years ago

Spent some time doing an analysis on enabling i64 unconditionally. On the size-optimized x64 MSVC build enabling the i64 op set extension increases binary size by only 2.7KB (both on disk and in memory, no other increases in post-load memory). The increase on a 32-bit build is probably larger (more instructions to emulate 64-bit) but not by an order of magnitude. The increase is limited to the bytecode dispatch loop because of how the vm list and buffer types were already designed to not cause type-based code duplication. image image

That feels worth it to enable by default. Float stuff is big and will remain as optional extensions.

benvanik commented 2 years ago

The code size increase on non-MSVC builds should be lessened by merging the i64 ext into the core op set by removing the computed goto jump table currently required (which is 2KB on its own in x64 builds).

powderluv commented 2 years ago

woah cool. can't wait for this. Are there any other dependent changes that would be required ?

benvanik commented 2 years ago

Should be pretty straightforward (🤞) - I'm going to need to break binary compatibility for that but I just added a nice error that'll spit out when trying to load the vmfbs from before the change (#9133). Once enabled unconditionally I'll switch HAL module exports to use i64s for offsets and such and that should be all we need for large buffers (to start). We'll still want to have a compiler flag for whether to emit 64-bit math for index types as the stack requirements and file sizes are 2x larger if everything is 64-bit, but that'll just be a single compiler flag and all runtimes on all platforms will be able to run code compiled either way!