godotengine / godot-proposals

Godot Improvement Proposals (GIPs)
MIT License
1.17k stars 98 forks source link

Use an intermediate representation format for GDScript #8605

Open vnen opened 11 months ago

vnen commented 11 months ago

Describe the project you are working on

The GDScript implementation.

Describe the problem or limitation you are having in your project

GDScript currently is compiled when loaded, even in a release build. There are a few problems with this approach:

Describe the feature / enhancement and how it helps to overcome the problem or limitation

An intermediate representation (IR for short) is able to help solving those issues.

It also allows to make an export template without the GDScript compiler, which can reduce in size and avoid potential exploits. This is optional, so people who use the compiler at release for dynamic scripts and modding support can still have it the way it is now (or a mix of the two).

There are a few potential drawbacks from this as well:

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Currently, GDScript is compiled to a bytecode which is later executed by the VM. This bytecode is not suitable for serialization, primarily because it contain a lot of pointers. Since the memory layout will likely be different when the executable runs again (especially in different machines), those pointers cannot be stored.

The plan is to include in the IR named references that can be reconstructed into the pointers. This includes global classes and function pointers which are used in the GDScript VM for fast access.

For each script when the project is exported, the process will go as follows:

For loading, the .gdc file will be read and put to another code generator. This one will be very simple as it will be just a matter of converting instructions from the IR into bytecode (which will follow a similar structure), including resolving the all the pointers.

IR format

While I haven't yet fleshed out the format exactly (as I believe it's easier to do while implementing it), it will be somewhat like this:

Things that can be accessed via index (like own properties, local variables, and function arguments) won't have a name associated to it stored in the data section and will use the index directly.

The instructions will have a similar structure to the bytecode. They'll have an opcode and a number of arguments. The arguments are encoded as "addresses" which can be either the regular bytecode addresses or special ones for the IR (such as getting the value of constant or a function pointer). There is no break between instructions since they will have a predictable length. All of this is stored as bytes which, if opened in a text editor, or even a hex editor, won't have anything recognizable beyond the data section.

If this enhancement will not be used often, can it be worked around with a few lines of script?

It will be used in almost every exported project, as it brings benefits to pretty much all of the cases.

Is there a reason why this should be core and not an add-on in the asset library?

It is a core part of GDScript and is not project specific, since it will be used by pretty much all projects.

dsnopek commented 11 months ago

Great idea!

I think this would also be really useful in debugging issues when working on the engine.

For example, in working on GDExtension issues with ptrcalls, I really wished I could have seen the GDScript bytecode in my test scripts, so I could tell which functions calls were actually being emitted as ptrcalls, and which weren't, because it was sometimes difficult to tell which it would actually do just from looking at the source code. (Side note: GDScript no longer makes ptrcalls, but I'm sure some other similar issue could come up in the future.)

nlupugla commented 11 months ago

Cool idea :)

My main concern is this point you highlighted:

It's possible to have bugs in the code that creates and reads the IR. So the IR might not be a faithful representation of the source script if something goes wrong. This is also a bit more difficult to debug issues in release builds, since the source code is not present anymore. The test suite can help mitigate this by checking if it behaves as expected.

It would be nice if the IR was involved in the normal compilation pipeline so that it would be impossible for the representation to be unfaithful. I'm not sure how that would work exactly, but I know there are compilers out there that transform to an IR as a step before generating the final machine code.

Mickeon commented 11 months ago

If we eventually we decide to precompile GDScript to machine code (AOT compilation) then this IR is pointless as the machine code supersedes it.

I have to admit AOT compilation is quite exciting... But, since you already seem to have a general grasp on what to do, I trust it would be better for this proposal to be implemented and solve the mentioned issues much sooner.

AThousandShips commented 11 months ago

A good balance with IR and AOT to me would be important, as I feel there's a risk that if we rely on AOT too much for performance in exported projects we can run into making debugging and scene testing difficult and laborious. With steps of optimization done on IR ahead of AOT (or without it, for example when running the project in the editor) you still gain some degree of performance improvements, but with just AOT and simple optimizations you can get a major difference between the performance in testing and in export, forcing projects which push the boundaries on performance to re-export every time they want to test even if they're not interested in specifically testing export level performance.

I don't see IR and AOT as mutually exclusive, quite the opposite I find it a good step to improve it

For one having optimization on IR allows us to rely less on competent optimization for AOT, allowing us to use a far simpler bare bones compiler that we can even bundle in the engine, which would greatly help users who are daunted by setting up a compiling environment, especially a cross-compiling one, we can then allow using an external, more competent, compiler for those who set it up

AThousandShips commented 11 months ago

This would also allow us to filter out blocks like if Engine.is_editor_hint(): without messing with the source

vnen commented 11 months ago

@AThousandShips IR itself does not imply any kind of code optimization. While I do think GDScript would benefit greatly from optimization passes, IR is not a requirement for that.

AThousandShips commented 11 months ago

I agree, didn't say it does, but it's a useful tool for it, it allows more manageable optimization than machine code, and allows doing it on the exported code, having persistent optimized code, avoiding having to do that every time the source is parsed

Machine code also makes things a lot harder to grasp, with jumps and similar, as opposed to a structured data format more coherently, and the more manageable mutability of it

As contrasted with AOT for runtime improvements when running from editor, etc.

So yes, I'm well aware thank you 🙃, and thought the aspects specific to IR Vs the non-persistent machine code was obvious as the point of my comments

nonchip commented 11 months ago

This would also allow us to filter out blocks like if Engine.is_editor_hint(): without messing with the source

@AThousandShips oh if we go as far as to treat that metaprogrammy, i'd rather not rely on what looks like a runtime function call tho (assert not starting with an @ is bad enough :P).

how about some fancy decorators like @editor or @runtime to specify 2 different declarations/codepaths/... for a thing depending on who loads it?

like eg this:

@tool
extends Node
@runtime var a = 5

@runtime func _ready():
  print("Hello, World!")

func _process(_delta):
  @editor:
    _handle_my_gizmos()
  @runtime:
    a += 1

@editor func _some_callback_for_a_plugin():
  # do some expensive stuff that doesn't need to go into the final product

where the editor would load this:

extends Node

func _process(_delta):
  _handle_my_gizmos()

func _some_callback_for_a_plugin():
  # do some expensive stuff that doesn't need to go into the final product

while the runtime would load this:

extends Node
var a = 5

func _ready():
  print("Hello, World!")

func _process(_delta):
  a += 1

but that feels more like a discussion for an additional/followup proposal. just wanted to give my 2 cents before gdscript learns to magically remove anything mentioning that engine hint :P

unless of course you are talking about introducing constexpr in general (and then folding the result), in which case GIMME :D

Mickeon commented 11 months ago

What you're suggesting is entirely unrelated to this proposal, but it has been similarly proposed in the past already:

SysError99 commented 10 months ago

This technique also opens up another way to wire up simpler interface registrations exclusively for any scripting languages that don't need string->address methods to recognise native interfaces (notably, GDExtension) and GDScript will be a great candidate. This also helps greatly for export binaries that don't need them, thus helps in their size significantly especially in platforms where binary size matters, such as HTML5. String labels still exist in the editor and GDExtension because without them it's impossible for GDScript language server to recognise and compile them, but in the GDScript-only release they will be removed.

vnen commented 10 months ago

@SysError99 not sure what "string labels" you're referring to. If it's about class and function names, this wouldn't be able to remove them.

The simplest example to show why they are necessary is any dynamic call:

extends Node
func _ready():
    $SomeNode.rotate(PI)

In this case the $SomeNode has an unknown type (it can be assumed to be Node but it can also be any of its derived classes) so the compiler can't tell what exactly rotate refers to. This is resolved at runtime with a call dispatch that requires the string to find the function. Same applies to properties/signals.

Nik4ant commented 10 months ago
extends Node
func _ready():
  $SomeNode.rotate(PI)

In this case the $SomeNode has an unknown type

50% related and 50% unrelated question: This script is attached to a certain node and this script is referencing a different node using a path $SomeNode - doesn't this mean that during compile time it should be possible to look up the type of SomeNode? (I'm not saying that such functionality exists in Godot right now, but in this should be possible, right?)

The initial statement is still correct, a better way to illustrate it though:

extends Node

func foo() -> void:
    get_child(0).rotate(PI)

Here it's 100% impossible to know the type of the first child node


Also, as someone who doesn't know a lot about inner workings of gdscript compiler + VM I genuinely wonder:

This is resolved at runtime with a call dispatch that requires the string to find the function 1) Does it always the case? When the type is known, can runtime just directly call/get/set whatever we want without dispatch? If it's about class and function names, this wouldn't be able to remove them. 2) In theory, would it be possible to remove some of them by specifying a type or a set of restrictions/guarantees? (for example, via traits or any other options) ^ Not refering to the proposed IR idea, but rather asking in general

vnen commented 10 months ago

50% related and 50% unrelated question: This script is attached to a certain node and this script is referencing a different node using a path $SomeNode - doesn't this mean that during compile time it should be possible to look up the type of SomeNode? (I'm not saying that such functionality exists in Godot right now, but in this should be possible, right?)

No, because the script does not know to which scene it's attached to and the same script could be attached to multiple scenes with different trees.

This is resolved at runtime with a call dispatch that requires the string to find the function

  1. Does it always the case? When the type is known, can runtime just directly call/get/set whatever we want without dispatch?

The thing is that it still needs to know what to call. To do so it needs to request the function from the ClassDB, which is done via string. This is cached when the GDScript is compiled if it is known, so it doesn't need to request at every call, but since it's a pointer it cannot be serialized. This will require the IR to still keep the names and request the pointers when compiling to proper bytecode, meaning the export template still needs the names.

If it's about class and function names, this wouldn't be able to remove them.

  1. In theory, would it be possible to remove some of them by specifying a type or a set of restrictions/guarantees? (for example, via traits or any other options) ^ Not refering to the proposed IR idea, but rather asking in general

Again no for the same reasons of the previous point.

We could potentially remove strings by replacing them with indices by putting the information in an array instead of a map, assuming those indices are known at compile time. This would require an overall refactor of core code and would break all GDExtensions. The main issue with this is making sure that the functions are never reordered, as this would break compatibility (there might be ways to validate this automatically, but it's one extra burden for contributors).

This cannot be done effectively because GDScript is still mainly a dynamically typed language. It can't really know the index in advance in most cases, so it has to request via strings and those would have to be present on the export template anyway.

Also note that the engine is not compiled on export, those are distributed pre-compiled (export templates). So we cannot strip the strings from this compiled binary, even we were to extract the subset of the used types. It would require recompilation of the template itself.

SysError99 commented 10 months ago

@vnen Essentially, it's not a direct reference to those calls like statically typed programming languages, but rather using a much shorter form of label (in this case, a number) instead of string. In this implementation, all known strings will have a central index that acts like a string map instead of using full bytes of string in compile time. Let's put these in the editor's executable, we have four strings in common that's used in the GDScript language server:

[
    "rotate",
    "size",
    "position",
    "scale",
    "radius",
]

After it is being converted during compile time, these become just an index. We will use UPPERCASE naming to indicate that these are just an index number.

[
    ROTATE,
    SIZE,
    POSITION,
    SCALE,
    RADIUS,
]

When the script gets "transpiled" in the release, they will instead use these indexes instead of strings, hence the reason why strings are not required in the release.

The serious limitation of this implementation is that we still need "some" strings for Godot's side, because it's virtually impossible to remap strings back to the much shorter form (an index number). Plus, with this implementation, it breaks all string-based wirings in the script, and so many functions need them, given that native calls aren't Callables. Without any new syxtaxes to help, it's impossible to implement them reliably.

Mickeon commented 10 months ago

Nothing has been done. It's all theoretical right now.