Use an intermediate representation format for GDScript

vnen commented 11 months ago

Describe the project you are working on

The GDScript implementation.

Describe the problem or limitation you are having in your project

GDScript currently is compiled when loaded, even in a release build. There are a few problems with this approach:

It's slow to load scripts because it has to parse the script itself and all of its dependencies.
- Since scripts are loaded individually, some script may have to be parsed multiple times as a byproduct of being a dependency of other scripts.
It requires more type information in the release builds to properly do all the checks and behave the same way as in debug.
It exposes plain text code of the project in the released game (see #4220).

Describe the feature / enhancement and how it helps to overcome the problem or limitation

An intermediate representation (IR for short) is able to help solving those issues.

It allows compiled scripts to be stored, including on export.
- Therefore, the exported project only needs to keep the IR version, not the original source code.
The IR doesn't need to be type-checked again, because that happened when producing it. So the release binaries can potentially be stripped of some type information.
The IR is similar to machine code and thus harder to read.
- While it is not impossible to recreate a GDScript source from IR, some information is still lost like comments and names of local variables/parameters, making the retrieved source harder to understand.

It also allows to make an export template without the GDScript compiler, which can reduce in size and avoid potential exploits. This is optional, so people who use the compiler at release for dynamic scripts and modding support can still have it the way it is now (or a mix of the two).

There are a few potential drawbacks from this as well:

If we eventually we decide to precompile GDScript to machine code (AOT compilation) then this IR is pointless as the machine code supersedes it.
- However, since this seems to be some time away, having the IR now will be useful for a while.
It's possible to have bugs in the code that creates and reads the IR. So the IR might not be a faithful representation of the source script if something goes wrong.
- This is also a bit more difficult to debug issues in release builds, since the source code is not present anymore.
- The test suite can help mitigate this by checking if it behaves as expected.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Currently, GDScript is compiled to a bytecode which is later executed by the VM. This bytecode is not suitable for serialization, primarily because it contain a lot of pointers. Since the memory layout will likely be different when the executable runs again (especially in different machines), those pointers cannot be stored.

The plan is to include in the IR named references that can be reconstructed into the pointers. This includes global classes and function pointers which are used in the GDScript VM for fast access.

For each script when the project is exported, the process will go as follows:

The script is loaded and parsed like it is now.
The code generation step, which usually creates bytecode, will create the IR instead.
- There is an abstraction for the code generator that can be swapped. So we can have the source to bytecode like it is now in-editor and use a different generator for the export.
The IR is stored into a file with a .gdc extension, in the same place the .gd file is.
- This is the same extension for the "bytecode" version in Godot versions before 4.0. Since that feature was removed, we can reuse the old extension this.
There's also a remap stored which uses the ResourceLoader system. So you can still load your .gd file and the remap will find the .gdc.
Only the .gdc file will be exported, the .gd will not.

For loading, the .gdc file will be read and put to another code generator. This one will be very simple as it will be just a matter of converting instructions from the IR into bytecode (which will follow a similar structure), including resolving the all the pointers.

IR format

While I haven't yet fleshed out the format exactly (as I believe it's easier to do while implementing it), it will be somewhat like this:

The file has a header that starts with a magic word of 4 bytes (spelling GDIR).
- This is similar to what is done in other binary formats in the engine and allows detecting corruption as well as avoid loading a random file as GDScript IR just because of the extension.
Next, there is a bytecode version number stored.
- This allows rejecting different versions and avoid creating spurious bytecode because the file was created in a different version of Godot, which would be prone to crashes.
- It's not the same as the Godot version, since the IR could work fine in a different version if the bytecode didn't change. While it requires diligence to update the version when changing the bytecode, this happens somewhat rarely.
There is a data section which will contain information referenced in the script.
- This includes names of global things, references to other things, names of public properties of the script (including functions and signals), and anything else that could potentially be present multiple times in the IR.
- It essentially avoids having to read strings in the instructions, as well as reducing the file size by having each string only once in the file, even if referenced multiple times.
There is a code section, which is split by functions (like it already is currently, as bytecode is only stored in each individual function).
- This includes hidden functions that are generated (like implicit constructors for initializing class variables and _ready code for the @onready feature).

Things that can be accessed via index (like own properties, local variables, and function arguments) won't have a name associated to it stored in the data section and will use the index directly.

The instructions will have a similar structure to the bytecode. They'll have an opcode and a number of arguments. The arguments are encoded as "addresses" which can be either the regular bytecode addresses or special ones for the IR (such as getting the value of constant or a function pointer). There is no break between instructions since they will have a predictable length. All of this is stored as bytes which, if opened in a text editor, or even a hex editor, won't have anything recognizable beyond the data section.

If this enhancement will not be used often, can it be worked around with a few lines of script?

It will be used in almost every exported project, as it brings benefits to pretty much all of the cases.

Is there a reason why this should be core and not an add-on in the asset library?

It is a core part of GDScript and is not project specific, since it will be used by pretty much all projects.

dsnopek commented 11 months ago

Great idea!

I think this would also be really useful in debugging issues when working on the engine.

For example, in working on GDExtension issues with ptrcalls, I really wished I could have seen the GDScript bytecode in my test scripts, so I could tell which functions calls were actually being emitted as ptrcalls, and which weren't, because it was sometimes difficult to tell which it would actually do just from looking at the source code. (Side note: GDScript no longer makes ptrcalls, but I'm sure some other similar issue could come up in the future.)

nlupugla commented 11 months ago

Cool idea :)

My main concern is this point you highlighted:

It's possible to have bugs in the code that creates and reads the IR. So the IR might not be a faithful representation of the source script if something goes wrong. This is also a bit more difficult to debug issues in release builds, since the source code is not present anymore. The test suite can help mitigate this by checking if it behaves as expected.

It would be nice if the IR was involved in the normal compilation pipeline so that it would be impossible for the representation to be unfaithful. I'm not sure how that would work exactly, but I know there are compilers out there that transform to an IR as a step before generating the final machine code.

Mickeon commented 11 months ago

If we eventually we decide to precompile GDScript to machine code (AOT compilation) then this IR is pointless as the machine code supersedes it.

I have to admit AOT compilation is quite exciting... But, since you already seem to have a general grasp on what to do, I trust it would be better for this proposal to be implemented and solve the mentioned issues much sooner.

AThousandShips commented 11 months ago

A good balance with IR and AOT to me would be important, as I feel there's a risk that if we rely on AOT too much for performance in exported projects we can run into making debugging and scene testing difficult and laborious. With steps of optimization done on IR ahead of AOT (or without it, for example when running the project in the editor) you still gain some degree of performance improvements, but with just AOT and simple optimizations you can get a major difference between the performance in testing and in export, forcing projects which push the boundaries on performance to re-export every time they want to test even if they're not interested in specifically testing export level performance.

I don't see IR and AOT as mutually exclusive, quite the opposite I find it a good step to improve it

For one having optimization on IR allows us to rely less on competent optimization for AOT, allowing us to use a far simpler bare bones compiler that we can even bundle in the engine, which would greatly help users who are daunted by setting up a compiling environment, especially a cross-compiling one, we can then allow using an external, more competent, compiler for those who set it up

AThousandShips commented 11 months ago

This would also allow us to filter out blocks like if Engine.is_editor_hint(): without messing with the source

vnen commented 11 months ago

@AThousandShips IR itself does not imply any kind of code optimization. While I do think GDScript would benefit greatly from optimization passes, IR is not a requirement for that.

AThousandShips commented 11 months ago

I agree, didn't say it does, but it's a useful tool for it, it allows more manageable optimization than machine code, and allows doing it on the exported code, having persistent optimized code, avoiding having to do that every time the source is parsed

Machine code also makes things a lot harder to grasp, with jumps and similar, as opposed to a structured data format more coherently, and the more manageable mutability of it

As contrasted with AOT for runtime improvements when running from editor, etc.

So yes, I'm well aware thank you 🙃, and thought the aspects specific to IR Vs the non-persistent machine code was obvious as the point of my comments

nonchip commented 11 months ago

This would also allow us to filter out blocks like if Engine.is_editor_hint(): without messing with the source

@AThousandShips oh if we go as far as to treat that metaprogrammy, i'd rather not rely on what looks like a runtime function call tho (assert not starting with an @ is bad enough :P).

how about some fancy decorators like @editor or @runtime to specify 2 different declarations/codepaths/... for a thing depending on who loads it?

like eg this:

@tool
extends Node
@runtime var a = 5

@runtime func _ready():
  print("Hello, World!")

func _process(_delta):
  @editor:
    _handle_my_gizmos()
  @runtime:
    a += 1

@editor func _some_callback_for_a_plugin():
  # do some expensive stuff that doesn't need to go into the final product

where the editor would load this:

extends Node

func _process(_delta):
  _handle_my_gizmos()

func _some_callback_for_a_plugin():
  # do some expensive stuff that doesn't need to go into the final product

while the runtime would load this:

extends Node
var a = 5

func _ready():
  print("Hello, World!")

func _process(_delta):
  a += 1

but that feels more like a discussion for an additional/followup proposal. just wanted to give my 2 cents before gdscript learns to magically remove anything mentioning that engine hint :P

unless of course you are talking about introducing constexpr in general (and then folding the result), in which case GIMME :D

Mickeon commented 11 months ago

What you're suggesting is entirely unrelated to this proposal, but it has been similarly proposed in the past already:

SysError99 commented 10 months ago

This technique also opens up another way to wire up simpler interface registrations exclusively for any scripting languages that don't need string->address methods to recognise native interfaces (notably, GDExtension) and GDScript will be a great candidate. This also helps greatly for export binaries that don't need them, thus helps in their size significantly especially in platforms where binary size matters, such as HTML5. String labels still exist in the editor and GDExtension because without them it's impossible for GDScript language server to recognise and compile them, but in the GDScript-only release they will be removed.

vnen commented 10 months ago

@SysError99 not sure what "string labels" you're referring to. If it's about class and function names, this wouldn't be able to remove them.

The simplest example to show why they are necessary is any dynamic call:

extends Node
func _ready():
    $SomeNode.rotate(PI)

In this case the $SomeNode has an unknown type (it can be assumed to be Node but it can also be any of its derived classes) so the compiler can't tell what exactly rotate refers to. This is resolved at runtime with a call dispatch that requires the string to find the function. Same applies to properties/signals.

Nik4ant commented 10 months ago

extends Node
func _ready():
  $SomeNode.rotate(PI)
In this case the $SomeNode has an unknown type

50% related and 50% unrelated question: This script is attached to a certain node and this script is referencing a different node using a path $SomeNode - doesn't this mean that during compile time it should be possible to look up the type of SomeNode? (I'm not saying that such functionality exists in Godot right now, but in this should be possible, right?)

The initial statement is still correct, a better way to illustrate it though:

extends Node

func foo() -> void:
    get_child(0).rotate(PI)

Here it's 100% impossible to know the type of the first child node

Also, as someone who doesn't know a lot about inner workings of gdscript compiler + VM I genuinely wonder:

This is resolved at runtime with a call dispatch that requires the string to find the function 1) Does it always the case? When the type is known, can runtime just directly call/get/set whatever we want without dispatch? If it's about class and function names, this wouldn't be able to remove them. 2) In theory, would it be possible to remove some of them by specifying a type or a set of restrictions/guarantees? (for example, via traits or any other options) ^ Not refering to the proposed IR idea, but rather asking in general

vnen commented 10 months ago

50% related and 50% unrelated question: This script is attached to a certain node and this script is referencing a different node using a path $SomeNode - doesn't this mean that during compile time it should be possible to look up the type of SomeNode? (I'm not saying that such functionality exists in Godot right now, but in this should be possible, right?)

No, because the script does not know to which scene it's attached to and the same script could be attached to multiple scenes with different trees.

This is resolved at runtime with a call dispatch that requires the string to find the function

Does it always the case? When the type is known, can runtime just directly call/get/set whatever we want without dispatch?

The thing is that it still needs to know what to call. To do so it needs to request the function from the ClassDB, which is done via string. This is cached when the GDScript is compiled if it is known, so it doesn't need to request at every call, but since it's a pointer it cannot be serialized. This will require the IR to still keep the names and request the pointers when compiling to proper bytecode, meaning the export template still needs the names.

If it's about class and function names, this wouldn't be able to remove them.

In theory, would it be possible to remove some of them by specifying a type or a set of restrictions/guarantees? (for example, via traits or any other options) ^ Not refering to the proposed IR idea, but rather asking in general

Again no for the same reasons of the previous point.

We could potentially remove strings by replacing them with indices by putting the information in an array instead of a map, assuming those indices are known at compile time. This would require an overall refactor of core code and would break all GDExtensions. The main issue with this is making sure that the functions are never reordered, as this would break compatibility (there might be ways to validate this automatically, but it's one extra burden for contributors).

This cannot be done effectively because GDScript is still mainly a dynamically typed language. It can't really know the index in advance in most cases, so it has to request via strings and those would have to be present on the export template anyway.

Also note that the engine is not compiled on export, those are distributed pre-compiled (export templates). So we cannot strip the strings from this compiled binary, even we were to extract the subset of the used types. It would require recompilation of the template itself.

SysError99 commented 10 months ago

@vnen Essentially, it's not a direct reference to those calls like statically typed programming languages, but rather using a much shorter form of label (in this case, a number) instead of string. In this implementation, all known strings will have a central index that acts like a string map instead of using full bytes of string in compile time. Let's put these in the editor's executable, we have four strings in common that's used in the GDScript language server:

[
    "rotate",
    "size",
    "position",
    "scale",
    "radius",
]

After it is being converted during compile time, these become just an index. We will use UPPERCASE naming to indicate that these are just an index number.

[
    ROTATE,
    SIZE,
    POSITION,
    SCALE,
    RADIUS,
]

When the script gets "transpiled" in the release, they will instead use these indexes instead of strings, hence the reason why strings are not required in the release.

The serious limitation of this implementation is that we still need "some" strings for Godot's side, because it's virtually impossible to remap strings back to the much shorter form (an index number). Plus, with this implementation, it breaks all string-based wirings in the script, and so many functions need them, given that native calls aren't Callables. Without any new syxtaxes to help, it's impossible to implement them reliably.

Mickeon commented 10 months ago

Nothing has been done. It's all theoretical right now.

godotengine / godot-proposals