Add support for multidimensional Arrays

vpellen commented 1 year ago

Describe the project you are working on

This is applicable to almost every project I have ever embarked on. Particularly relevant for procedural generation and grid-based games.

Describe the problem or limitation you are having in your project

Trying to deal with large multidimensional datasets is exceptionally tedious. One has to either sacrifice efficiency (by having arrays of arrays or using a dictionary) or readability (by having a 1D array and indexing it manually).

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Multidimensional arrays would improve overall readability and possibly performance when dealing with large multidimensional datasets.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

The most critical question is how many types there should be and how they should be named and arranged. I imagine the most trivial solution would be to add a series of new fixed-dimension types such as Array2D, Array3D, PackedByteArray4D, and so on. The main concern for such types is what the upper limit would be - 4 dimensions is probably a healthy practical limit (so you can have an extra level of dimensionality for ostensibly 3D datasets).

The alternative is to give each array a "dimension count" property. This would facilitate higher levels of dimensionality, but might make things needlessly complex, especially in terms of indexing. Perhaps a MultiArray class could serve to be a unique type distinct from the base 1-dimensional array types, but you'd still face the issues of indexing and complexity.

For indexing style, CSV within brackets (arr[y, x]) is likely optimal; Sequential brackets (arr[y][x]) can cause some ambiguity in cases with nested containers.

For resizing, the internal structure of these arrays is likely to be a simple array of consecutive values, so a resize that preserved the original array's contents would be slow by design, but in my experience large multidimensional datasets tend to remain fairly fixed.

If this enhancement will not be used often, can it be worked around with a few lines of script?

Multidimensional arrays are a frequent occurrence, and while manual indexing of a 1D array can be done via fairly trivial multiplication, it can quickly make things cluttered, especially when you want to index based on an expression rather than a single value.

Is there a reason why this should be core and not an add-on in the asset library?

Multidimensional arrays are used frequently enough to warrant being a part of core. Also, come to think of it, I'm not sure GDScript actually has support for CSV indexers.

Calinou commented 1 year ago

Adding new Variant type has a performance cost, so it should be carefully considered. It needs to be used often enough within the engine to be worth doing.

AThousandShips commented 1 year ago

Also related to #5148

vpellen commented 1 year ago

@Calinou New variant.. wait, are all array types independent variants? I thought they were a part of that giant class pool.. although now that you mention it, I guess they're not part of the Object inheritance tree.. Maybe arrays in general need some kind of overhaul so that we don't need to clog the variant list with half a dozen different heterogenous arrays? But maybe that ties into bigger problems that may just eventually get solved indirectly.. I know there's that proposal regarding GDScript compilation.

The woes of dynamic typing, I suppose.

vpellen commented 1 year ago

Maybe what we really need is just some kind of configurable indexer? Give all array types an extra 16 bytes for dimensionality that only comes into play when you invoke the multidimensional indexer?

KoBeWi commented 1 year ago

Array of arrays doesn't sacrifice performance. The only cost is initializing it. If we had "real" 2D arrays they'd internally be arrays of arrays anyway.

If anything, I think we just need a helper method to easily initialize multidimensional arrays.

vpellen commented 1 year ago

In the interests of being fair, I decided to actually test the comparative performance of arrays of arrays vs manual indexing.

extends Node

func _ready():
    randomize()

func _process(delta):
    test()

func test():
    var a : PackedInt32Array
    var b : Array[PackedInt32Array]

    a.resize(1024 * 1024)
    b.resize(1024)
    for i in 1024:
        var r : PackedInt32Array
        r.resize(1024)
        b[i] = r

    for y in 1024:
        for x in 1024:
            a[y * 1024 + x] = randi()
            b[y][x] = randi()

    var lookups : PackedInt32Array
    lookups.resize(2_000_000)

    for i in 2_000_000: lookups[i] = randi_range(0, 1023)
    var ta = Time.get_ticks_usec()

    for i in 1_000_000:
        var x : int = lookups[i as int * 2]
        var y : int = lookups[i as int * 2 + 1]
        a[y * 1024 + x] += 1
    ta = Time.get_ticks_usec() - ta

    for i in 2_000_000: lookups[i] = randi_range(0, 1023)
    var tb = Time.get_ticks_usec()
    for i in 1_000_000:
        var x : int = lookups[i as int * 2]
        var y : int = lookups[i as int * 2 + 1]
        b[y][x] += 1
    tb = Time.get_ticks_usec() - tb

    print("Manual Indexing: ", ta / 1000000.0)
    print("Array of arrays: ", tb / 1000000.0, "\n")

At least on my machine, the performance differences were admittedly imperceptible. I still think multidimensional arrays (or at least a multidimensional indexer on existing arrays) would be a good idea, but I can't really claim a speed deficit as a motivating force.

Edit and side note: I am astonished at just how significantly things speed up when you make typing explicit.

lostminds commented 1 year ago

I think this would be a good idea. Perhaps most of all to help new developers have a nicer time starting out with GDScript and improve their on-boarding experience. Since making something with a 2d grid of some sort is so common I imagine a lot of new users will be tripped up very early on by not being able to easily make a 2d array to store their game data state in.

vpellen commented 1 year ago

Hello, I'm back.

So I ran into this issue again while testing out some unrelated code, and I managed to produce a fairly straightforward test case where nested arrays were considerably slower. After doing various investigations trying to figure out why it didn't manifest in my earlier tests, I realized that I made a rather blind mistake in my original test code: I was trying to verify the impacts of cache locality on what was, in reality, a relatively small array. So I wrote a new test that doesn't try to cache random lookups, and which also operates on an array of larger dimensions:

func test_new():
    var dimension_size := 2**14

    var array_array : Array[PackedInt32Array] = []
    array_array.resize(dimension_size)
    for i in dimension_size:
        array_array[i].resize(dimension_size)

    var packed_array := PackedInt32Array()
    packed_array.resize(dimension_size ** 2)

    var test_count := 10_000_000

    var taa := Time.get_ticks_usec()
    for i in test_count:
        var x := randi_range(0, dimension_size - 1)
        var y := randi_range(0, dimension_size - 1)
        array_array[y][x] += randi()
    taa = Time.get_ticks_usec() - taa

    var tpa := Time.get_ticks_usec()
    for i in test_count:
        var x := randi_range(0, dimension_size - 1)
        var y := randi_range(0, dimension_size - 1)
        packed_array[y * dimension_size + x] += randi()
    tpa = Time.get_ticks_usec() - tpa

    print(" Array Array: ", taa / 1_000_000.0)
    print("Packed Array: ", tpa / 1_000_000.0)

With this test, the nested arrays run about 25% slower, at least on my machine. Results will likely vary based on the machine running the test and the size of the array, but once you start hitting the upper end of the cache, the impact is both consistent and substantial. That said, I still believe the primary impact on this is convenience rather than performance.

I've been thinking about how this issue could be solved in a way that is as minimally disruptive as possible, and one idea I had was to just give each array three or four extra integers representing dimension sizes, and a multidimensional array indexer that uses them to perform the trivial multiplication involved in a multidimensional array lookup. The memory overhead would be something in the realm of 32 bytes per array, which I feel would be fairly insignificant for what it would buy, though I'm not sure of the syntactical implications of trying to rope a multidimensional indexer into GDScript.

I still think this is a recurring issue with widespread applications, but given I lack the knowledge required to implement such a feature myself, it seems more likely that some future developments will render it redundant (like that GDScript-to-C refactor being talked about in combination with operator overloading and better custom types at some point in the future perhaps). Maybe some of the issues involved with this will also become minimized over time if typed arrays ever manage to evolve to the point where they can reliably merge with packed arrays.

MichaelMacha commented 1 year ago

I bumped into this issue recently when I was attempting to keep an array to signify whether land was cleared and buildable in my city-builder. Not terribly demanding for performance; I just need to set it all to true, or a noise value, at the beginning of the game; and access it by index later; so mild slowdown is acceptable for convenience.

I think a lot of this would be substantially easier if it was possible to declare a type that had nested arrays in it; i.e. Array[Array[int]]. Unfortunately GDScript does not support this yet, either, and we're still kind of stuck with that nebulous Variant thing or me doing my own indexing math.

My "workaround" so far has been either to just tolerate the slowdown, or jump to GDExtension and hash it out in C++; and while I love GDExtension, this feels like a very extreme and arbitrary case for its use. The whole point of GDScript is to give us the most padded-cell language possible for front end stuff, zero or near-zero chance of a CTD. Breaking that for the sake of a multidimensional array feels wrong.

The other possibility has been to use image textures, depending on the data content. If you turn off anti-aliasing/smoothing and reference a pixel by coordinate, you can get and set its data. If it's three dimensional you can use a texture array or maybe color channels. Even better, it's easily transferable to a shader. That said, this is still a workaround and it will only work for numerical values, or something that can be translated into an RGB-ish format, which the majority of cases cannot.

Ivorforce commented 1 month ago

As a data scientist, I am well aware of how useful some tensor math can be, and I've missed similar functionality in my past Godot projects quite frequently.

However, we should first find out if there is any demand for tensor maths in the Godot environment at all. While there are benefits to integrating it into the main engine, there is no real reason not to try it as a gdextension first. If it turns out people really need it, it can always be put into the main engine later.

That being said, that's pretty much what I did: I wrote a (proof of concept) tensor math library that binds to gdscript. You can find it here: https://github.com/Ivorforce/NumDot

Help and feedback are welcome, although the core of the project was quite a bit easier than I thought it might be.

godotengine / godot-proposals