implement the `L25` intermediate language

zerbina commented 1 month ago

Summary

Add the L25 language and integrate it into the pass pipeline. The L25 features a flat procedure structure and goto-esque control-flow, without using a basic block structure and SSA yet. It comes after L30 and before L4.

Details

The new IL is planned to be the first in a series of ILs that all use flat procedure bodies with goto-esque control-flow constructs. This structure works well for the stage during compilation where data- and control-flow analysis is needed, but where the live range(s) of locals can still change.

Future passes that are currently planned to use this structure are: borrow checking, cursor inference, move analysis, destructor injection, and inlining.

Implementation

pass30 is effectively split into two passes. Turning the structure control-flow constructs (i.e., If, Case, Loop, etc.) stays in pass30, while the data-flow analysis and SSA transformation moves to pass25 (without being modified).

Most of the pass25 are former pass30 tests (those concerning data- flow analysis and SSA transformation), the rest are new tests covering the basic L25 to L4 translation.

To-Do

[x] add more tests. All IL features should be covered
[x] implement Except support in pass7

Notes for Reviewers

this PR is part of the pass rework project

zerbina commented 1 month ago

I'm going to wait with merging this addition until the listed passes are actually needed. While I think the language is a solid choice for the state purpose, it's possible that there's a better approach that I haven't yet considered, so I'm not rushing to add the IL (just to remove it again later, should a better alternative emerge).

The first pass out of the listed one that I think will be needed is drop injection.

zerbina commented 1 month ago

For making sure the pass is working correctly, and to also get some feedback on performance, I've run pass7 on L7 code translated from the fully processed MIR produced by the NimSkull compiler for the repl.nim program (~2.5MB of packed nodes).

Besides discovering multiple issues with the data-flow analysis (which are fixed by #54), this also showed that the pass takes up too much time. In a normal debug build, the pass takes 30 seconds (!) to process all procedures, while in -d:release mode, it takes 4 seconds. Considering that the repl.nim is only a small to medium sized program (~1000 procedures), this is far too much time.

Reasons for Slowness

In order of significance:

Issue with Changeset.replace. The changeset's node sequence is, for some reason, not moved into the builder, resulting in a full copy of the whole sequence
PackedSet is slow. Or at least the operations relevant to the data-flow analysis (i.e., union and intersection) are. In addition, a PackedSet instance itself has a very large static size (320 byte!), ballooning up the static size of BBlock to 688 byte! This quickly adds up, especially since there are usually a lot of basic blocks in a single procedure.
Type lookup is slow. Looking up a type via its index requires skipping over all predecessor nodes in the tree, which takes longer the further the index is away from the start.
PackedSet.len is slow. Especially if only used to test whether the set is empty or not.

Number 1 is easy to fix, and 2 and 4 can be addressed by using a Table-based sparse set implementation. With the aforementioned three things fixed, the pass only takes ~550ms in release mode, which - while a lot better - is still too long.

Reducing the number of basic blocks (by combining them where possible), reducing the maximum number of variables live at the same time (by adding "end of storage duration" markers to the L7), some general optimization to the pass itself, as well as improving type lookup efficiency (possibly through using a skip list) should together be able to shrink the pass' run time to somewhere below 100ms, which seems acceptable.

zerbina commented 1 month ago

Okay, Except support is now implemented and the fixes from main are merged. Some tests are missing, but otherwise the bulk of the work should be done.

I did quite a bit of testing with real-world code bases, and I'm now fairly certain that the structure and idea with L7 is the right choice. There are a few things that need to be changed (relative to the current state), namely:

"end of life" marker for locals. Something like StorageEnd or StorageDead), in order to mark them as dead. Right now, locals have to be considered alive after their first write (or possible write), which leads to locals that have their address taken usually living much longer than necessary.
A Move operator. It functions similar to Copy, with the addition of communicating that the source location is not used afterwards. An L7 Move would be translated directly to an L4 Move.
Requiring locals to be initialized prior to being used. (Applies to both the L10 and L7.) This would simplify some code, by removing the need for auto-spawning, and - more importantly - removes a case of undefined behaviour (i.e.: what's the content of an uninitialized local?). In case not initializing locals prior to their first use should be allowed in the source language - like it is in NimSkull -, there needs to be a separate pass that initializes the problematic locals with their type's default value.

This PR is only concerned with splitting up pass10, so the changes should happen via follow-up PRs.

zerbina commented 3 weeks ago

The language is somewhat of a reinvention of NimSkull's MIR, but without its problems and unnecessary complexity. Most notably:

there is no finally. Finally sections are a major source of complexity in the MIR, also being the reason why target lists (another source of complexity) have to exist. They complicate the MIR's structure, the data-flow analysis, and especially code generation. It makes much more sense to try/finally early (into try/except and block), instead of just prior to code generation, where it's much harder and more complex to do so.
there's no structured if. The idea was to keep some structure in order to ease code generation, effectively making it a workaround for shortcomings of the C/JS code generators. The L25 allows arbitrary branching control-flow (as long as it's points forward and doesn't cross into loops)
expressions can be nested. Not allowing rvalue expression nesting does make some processing simpler, but it also introduces overhead (larger IR size, more work in some cases). The current rules are likely going to be narrowed-down in the future.

nim-works / phy