golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.97k stars 17.67k forks source link

proposal: spec: generic parameterization of array sizes #44253

Open ajwerner opened 3 years ago

ajwerner commented 3 years ago

See detailed design here. The recent acceptance of #43651, we can now begin discussing filling in the gaps and omissions of that proposal. This issue provides a proposal to fulfill the following stated omission in the type parameter proposal:

No parameterization on non-type values such as constants. This arises most obviously for arrays, where it might sometimes be convenient to write type Matrix[n int] [n][n]float64. It might also sometimes be useful to specify significant values for a container type, such as a default value for elements.

The structure of the proposal boils down to two extensions:

1) Expression of all array types inside of type lists

Today one can specify a type constraint that a type is an array of some specific, constant size. For example:

type Array8[T any] interface {
    type [8]T
}

Similarly one can enumerate a set of array sizes:

type ArraysOfSomeSizes[T any] interface {
    type [2]T, [4]T, [8]T, [16]T
}

But, there's no way to capture the idea of all arrays of a given type. This proposal proposes the following syntax to capture that idea:

type Array[T any] interface {
    type [...]T
}

This syntax is familiar as it is used for length inference in array literals.

2) Permit constant expressions to determine the length of an array type using len([...]T)

I'm not wedded in any way to the function with this behavior actually being len. It could be some new thing like ArrayLen([...]T) in some package somewhere.

Today go provides both constant expressions which take types (see unsafe.Sizeof) and more generally builtin functions which take types. Secondly, the go compiler already statically computes array lengths at compile time when it can.

Constant expressions can be already be used to size arrays.

Taken together, one can imagine using this constant expression over some parameterized array type to size other array types.

A detailed discussion with examples, background and motivation can be found in this proposal.

ianlancetaylor commented 3 years ago

Thanks. I'm not quite sure about the constant expression bit; it's important to not constrain the compiler implementation. In any case I think we have our hands full with the current proposal. I'm going to put this on hold for now for consideration as a future language extension.

ajwerner commented 3 years ago

Thanks for giving it a look.

I'm not quite sure about the constant expression bit; it's important to not constrain the compiler implementation.

Look forward to working out the implications here for the compiler when the time comes. Will give it a rest for 6-9mo and see how the rest of the type parameter implementation shakes out.

gopherbot commented 3 years ago

Change https://golang.org/cl/301290 mentions this issue: Add proposal for parameterized generic array sizes

rogpeppe commented 3 years ago

A few thoughts:

With regard to the last point, I don't see a huge difference between:

type Array[T any] interface {
    [...]T
}
func concatArrays[T any, A, B Array[T]](a A, b B) [len(A{}) + len(B{})]T {
    var r[len(A{}] + len(B{})]T
    copy(r[:], a[:])
    copy(r[len(a):], b[:])
    return r
}

and:

func mkPair[A, B any](a A, b B) struct{T0 A; T1 B} {
    return struct{T0 A; T1 B}{
        T0: a,
        T1: b,
    }
}

They're both forming a new type with a size dependent on the sizes of two type parameters.

The main thing that concerns me about this proposal is that there is considerable potential for code bloat if the feature is used a lot, particularly if the stenciling or GC stenciling implementation approach is used. But there's also potential for the same kind of issue even without generic arrays so maybe that's not a good argument.

It's just a bit easier with arrays. For example:

const n = 1<<30

func F[T interface {[...]byte}]() int {
    if len(T{}) >= n {
        return len(T{})
    }
    return F[[len(T{})+1]byte]()
}

That particular kind of abuse could be avoided by prohibiting recursive generic functions where the type parameters of the recursive call aren't identical. I haven't yet seen a good use case for allowing that yet, and it avoids a bunch of potential problems.

In general, I'm +1 on this proposal as long as we can be happy that it's not open to serious abuse (e.g. computation in the type system).

Merovius commented 3 years ago

With regard to the last point, I don't see a huge difference between: […] and […]

I agree, but it should be noted that this is not the only kind of code you can write. In particular, you can use len(a) in any expression. This leads to problems.

I think it's generally important to keep the type-checking of the body a generic function independent of the type-argument. And AFAIK that's the plan, with sentences like "a generic function can use an operation if its valid for all possible type-arguments". (Note: This is what's in the current design). However, in general, that's not necessarily feasible. A simple example of the kinds of problems we might encounter is

func F[A interface{ [...]T}, T any]() {
    const _ uint8 = uint8(len(A{}))
}

This does not type-check if A is [256]int, for example. Off-the-cuff, this seems to imply that type-checking generic functions is, again, equivalent to solving SAT and thus NP complete. That is, you can encode any binary formula by

My example would thus be encoded as (using the type-list syntax for now):

type (
    Var interface { type […]struct{} }
    True [256]struct{}
    False [0]struct{}
)

// (A∨¬B∨¬C) ∧ (¬A∨C∨D)
func F[A, B, C, D Var]() {
    const (
        // for readability
        a, b, c = len(A), len(B), len(C)
        notA, notB, notC = 256^a, 256^b, 256^c
    )
    const _ = uint8( (a | notB | notC) & (notA | c | d) )
}

This fails type-checking if and only if the formula is satisfiable - as long as we limit type-checking to the generic function, using the constraints on the type-parameters only.

Of course, if we are willing to delay type-checking of constant expressions until instantiation, this problem goes away.

e.g. computation in the type system

I think this is excluded by the initialization order section of the spec, which effectively limits the kinds of recursion you can do. But I'm not 100% sure. I believe this at least requires some thinking, because I do think we have some recursion at least.


FTR, this is not limited to len(A{}) though. unsafe.Sizeof has exactly the same problem. That is, you don't need generic arrays to run into this, you can use struct{} and [256]byte for FALSE and TRUE respectively and use unsafe.Sizeof instead of len and get the same argument.

So, I actually think we need likely can't make len and unsafe.Sizeof (and friends) constant expressions if they are applied to generic variables in any case.

Merovius commented 3 years ago

@ianlancetaylor highlighting you to be sure - see my last comment, I think this requires some refinement of the generics proposal as a whole, in terms of what it means for constant expressions.

ajwerner commented 3 years ago

I think this is excluded by the initialization order section of the spec, which effectively limits the kinds of recursion you can do. But I'm not 100% sure. I believe this at least requires some thinking, because I do think we have some recursion at least.

I'm not sure I understand how instantiation of generic types relates to initialization order. Isn't initialization order a runtime property? Are you suggesting that the generic type checking should have to work on the infinite set of types implied by constraints at compile time? I don't see why that has to be the case.


Just to make sure I understand, is this any different without [...]struct{} support? Say instead the constraint were as follows, wouldn't we have the same problem?

Var interface { type [0]struct{}, [256]struct{} }
Merovius commented 3 years ago

I'm not sure I understand how instantiation of generic types relates to initialization order.

Maybe nothing. I was thinking about stuff like this and other recursive type definitions. I'm not quite sure where in the spec this is dealt with right now.

But ultimately, the computation question is off-topic right now :) I haven't found a specific problem (yet). So feel free to disregard that part :)

Just to make sure I understand, is this any different without [...]struct{} support? Say instead the constraint were as follows, wouldn't we have the same problem?

Yes. The problem is specific to len and unsafe.Sizeof and friends being constant expressions under certain circumstances. The problem isn't even restricted to arrays - you can even use byte, struct{}, unsafe.Sizeof and add a couple of bitshifts, to get to the same problem. So it doesn't matter for discussing the introduction of [...]T per se, it only immediately matters for the idea of having len (et al) be constant expressions.

Unfortunately, I think len (et al) not being constant expressions limits the usefulness of array type constraints. You could still write a function that works on any array type, but you could not, for example, write a generic concat function for arrays. At least as far as I can tell. Ultimately, I think you'd be served just as well using slices in that situation - though I'd be ready to be proven wrong.

ajwerner commented 3 years ago

I think the big learning of the day is that it's particularly hard to understand whether something is going to (1) overflow or (b) produce a value that is larger than the address space. As @Merovius has pointed out, we're in a real pickle if we want to determine whether an expression could overflow. Indeed, we've now discussed two different classes of errors which might occur with these generic constant expressions:

(1) is squarely a problem of the type system and is something we deal with to make this proposal coherent. (2), however, is not a concern of the type checker but rather a concern of compilation; it is a platform-specific concern. As I see it, the type system, in a sense, does not care about the bounds of the memory space (and, relatedly, the word size used for int/uintptr). The type system assumes those could be infinite.

So, below find my thought process and proposal to try to address the problems in a way while attempting to come up with rules that are small, simple, and coherent with the concepts of the language.

Rejected Straw-Man: Resolve overflows upon instantiation

One straw-man proposal might be to just ignore the possibility for overflow problems until later in compilation/linking when all instantiations are known. For a given instantiation, it's trivial to determine if overflow occurs. This approach, however, violates a key principle of the Type Parameters Proposal as laid out in the comparison to C++:

This means that changing template code can accidentally break far-off instantiations. It also means that error messages are reported only at instantiation time, and can be deeply nested and difficult to understand. This design avoids these problems through mandatory and explicit constraints.

If we did this checking later and some constant expression deep in the bowels of the library changed, it may break code which had previously worked. Such changes would be hard to detect, but, in a sense, might be considered part of the API.

Proposal: Reject typed constant expressions referencing type parameters

The succinct rule to ward off problem (1) and @Merovius's concerns about detecting overflow is to prevent a constant expression which references a type parameter from ever becoming typed. This, with perhaps minor tweaks to the spec, should work well for the motivating use case. While the language spec requires that the array length must be a constant expression "representable by int", the type checker, as far as I understand, doesn't need to care about the maximum size of an int [1]. This solution does not address memory capacity overflow problems, but, then again, we are plagued by those today [2]. As far as the type system is concerned, we may have infinite memory. This proposal would allow generic data structures but would not allow these constant expressions to be directly assigned to concrete numeric types.

The go spec has some langauge describing when a constant expression becomes typed. The extension to the spec would say something to the effect of: all constant expressions which reference type parameters cannot be typed, cast to a type, used in an assignment to a typed constant, or assigned to a variable. The other additional extension would be to note that len of a typed expression whose type is a type parameter is not a constant when used as part of an expression which must be typed.

Let's explore the implications of this proposal:

type Var interface { type [0]struct, [256]struct } 

func F[A Var]() {
    const _ = uint8(len(A{})) // illegal, cannot convert untyped constant referencing type params to uint8
    var _ uint8 = len(A{}) // illegal, cannot convert untyped constant referencing type params to uint8
    var _ = uint(len(A{})) // legal cast but not guaranteed to be evaluated at compile time (though certainly can be).
}

This proposal also allows a generic concatenation function (assuming we have [...]T syntax) as in https://github.com/golang/go/issues/44253#issuecomment-820999513.

[1] I would like to believe that we'd say that the below program type checks in the go programming language even though it doesn't compile on 32-bit platforms.

func main() {
    fmt.Println(len((*[math.MaxInt32+1]byte)(nil)))
}

[2] consider the following silly program:

type Arr[T any] interface {
    type [1]T, [math.MaxUint32]T
}

type Big2DArr[T any, A Arr[T]] [math.MaxUint32 >> 14]A

The expression len(s) is constant if s is a string constant. The expressions len(s) and cap(s) are constants if the type of s is an array or pointer to an array and the expression s does not contain channel receives or (non-constant) function calls; in this case s is not evaluated. Otherwise, invocations of len and cap are not constant and s is evaluated.

I did not realize when I wrote this that len is already a constant expression! That changes the nature of this proposal a bit I suspect. Mea culpa for not reading the spec closely enough. I'll update this proposal accordingly at some point as it is currently inaccurate.

Merovius commented 3 years ago

To repeat what I just said on slack: I only used constant overflow because it's an easy to trigger compilation failure. But any compilation failure that depends on an integer constant can be used for this.

For example, indexing an array would be suitable, and it doesn't require a typed constant.

Similarly, you could declare an array type and assign that to a statically known array type:

type (
    Var   interface{ type True, False}
    True  [256]struct{}
    False [0]struct{}
)

func F[A, B, C, D Var]() {
    const (
        // for readability
        a, b, c          = len(A), len(B), len(C)
        notA, notB, notC = 256 ^ a, 256 ^ b, 256 ^ c
    )
    type X [(a | notB | notC) & (notA | c | d)]int
    type Y [0]int
    var _ Y = Y(X{}) // valid if and only if the formula always evaluates to false
}

This one is especially interesting, because "use a constant expression derived from one array types length to define another array type" is the entire point of making len constant expressions. But it can still be used to break type-checking. So ISTM that the goals of this proposal are relatively fundamentally at odds with the design criteria of the generics proposal. But, of course, I might be wrong and it wouldn't be the first time :)

I think the core problem is that it's not entirely clear what the rule "an operation is valid, if it is valid for every type fulfilling the constraints" means, in many circumstances. For example, this last example could just reject the code no matter if the formula is satisfiable or not - but that would violate this rule. That might be okay, but we need to make sure it's understandable in what circumstances we violate it for which classes of operations. The spec must clearly state that the compiler rejects this program and why.

I'm also not entirely sure, how we would distinguish "cases like this" from more reasonable cases. i.e. I'm not sure what differentiates type X [len(A{})+1]int from type X [((len(A{}) | (256^len(B{}) | (256^len(C{}))) & ((256^len(A{})) | len(C) | len(D))]int to the compiler. Like, it's clear to me, as a human, that one of them is a reasonable formula used to implement a B+ tree or something while the other is a wild attempt to trick the compiler into doing unforeseen things. But I couldn't draw a line of what makes one okay and the other not.

ajwerner commented 3 years ago

Indexing into an array by a constant seems seems like a non-issue. If the type is constrained to be the set of all array lengths, then no constant value is safe to iterate over all of those lengths. Indexing by a variable into an array is a runtime check.

The casting case, however, is more interesting. Hopefully there's a nice out there too. I'll think on it. Thanks!

Merovius commented 3 years ago

Indexing into an array by a constant seems seems like a non-issue. If the type is constrained to be the set of all array lengths, then no constant value is safe to iterate over all of those lengths.

To clarify the same way I did on slack, this is the real idea:

func F[A, B, C, D Var]() {
    const (
        // for readability
        a, b, c          = len(A), len(B), len(C)
        notA, notB, notC = 256 ^ a, 256 ^ b, 256 ^ c
    )
    type X [10]int
    var _ = X[(a|notB|notC)&(notA|c|d)] // valid if and only if the formula always evaluates to false
}

The array bounds are fixed - the index is varied.

ajwerner commented 3 years ago

Fortunately the spec and the above constraint prevents this one too.

The spec says:

the index x must be of integer type or an untyped constant

As noted above, anything that's a constant expr involving a type parameter must be an untyped constant.

The spec also says:

a constant index that is untyped is given type int

We've already noted that typing an untyped expression such as this is disallowed.

https://golang.org/ref/spec#Index_expressions

rogpeppe commented 3 years ago

This proposal also allows a generic concatenation

I'm fairly sure it doesn't unless the return type is a slice rather than an array (which is fine - array concat is not necessarily a useful operation).

Allowing array lengths as const expressions (and hence arithmetic) does seem to open the gateway to a world of potential pain, regardless of this proposal. Without that, I don't think this proposal is so useful, as you can't form any array types other than the original (for example an array of bool to mark validity of elements).

In the end, this is just a performance issue and perhaps the consequent overhead of using slices rather than arrays is worth it for the avoidance of all those language issues.

ajwerner commented 3 years ago

@rogpeppe I think it the proposal would permit the following:

type Array[T any] interface{ [...]T }
type ArrayConcat[T any, A, B Array[T]] interface { [len(A{}) + len(B{})]T }

func Concat[T any, A, B Array[T], R ArrayConcat[T, A, B]](a A, b B) R {
    var r R
    copy(r[:], a[:])
    copy(r[len(a):], b[:])
    return r
}
ianlancetaylor commented 3 years ago

Personally I tend to think that len called on a value of parameterized array type should return a non-constant value of type int. That is what happens if you call len on a slice, after all. The rules for constants in Go are complex enough that I think we need to keep them out of generic code.

The same applies to unsafe.Sizeof and unsafe.Offsetof.

If we did this, it would make it harder to write certain kinds of code, like defining a new type that was an array whose length was the same as that of some type parameter. But I think that we can live with that.

ajwerner commented 3 years ago

My short term preference would be to disallow unsafe.Sizeof, unsafe.Offsetof and len on type parameter expressions altogether. For arrays if you want to call len you could be required to have a [:] to make it a slice or something. The reason to disallow it is that it preserves future flexibility. Once you commit to those functions being runtime non-constant expressions then it seems that you're sort of stuck.

bcmills commented 3 years ago

@ajwerner, is it not backward-compatible to change a non-constant expression to a constant of the same type?

(It isn't backward-compatible to change a variable to a constant, because the variable can be reassigned or have its address taken — but a call-expression for len or unsafe.Sizeof is not addressable and nothing can be assigned to it.)

ajwerner commented 3 years ago

@bcmills that's a good point. In fact, all 3 of these functions already return typed values in constant expressions. In effect, I guess just making them non-constant expressions when involving type parameters preserves flexibility.

but a call-expression for len or unsafe.Sizeof is not addressable

Yet. :eyes: on https://github.com/golang/go/issues/45624.

fumin commented 2 years ago

A real world example for such a need is the rtreego library. This proposal would in particular avoid this kind of painful change https://github.com/patrick-higgins/rtreego/commit/50fb352b9b2b46ac511d907162c3dfc0c0d6b2f0

mlevieux commented 2 years ago

Hi,

I'm currently trying to change several of the slices in our codebase into arrays because we always know their size and it does not change. One of the "issues" I'm facing is that json does not behave the same with []byte as with [N]byte, so I'm searching for a simple way to make the change without breaking everything or starting a huge migration.

This proposal would seem to answer my problem, but I see no way of instantiating such a generic... I see how it is usable in functions like those presented in this discussion, but say I want to have a generic array with custom marshalling/unmarshalling routines, I'd probably go with something like the following:

type GenericArrayCustomMarshal[T interface{ [...]byte }] T

func (arr *GenericArrayCustomMarshal[T]) UnmarshalJson(p []byte) error {
    var b []byte
    err := json.Unmarshal(p, &b)
    if err != nil {
        return err
    }

    copy(*arr, b)
    return nil
}

except how do I instantiate a struct containing a GenericArrayCustomMarshal field? How do I tell it that its generic constant expression resolves to say 3 in this structure? Also, would the copy builting work as I wrote it? I'm not sure :thinking:

I like this proposal but I'd say we need a way to specify the constant expression as for types:

// const generic only
type GenericByteArray[const n] interface{
    ~[n]byte
}

// const generic + type parameter
type GenericArray[const n, T any] interface {
    ~[n]T
}

This would be allowed to be instantiated with any constant expression in place of const n, as long as it represents an int. Hence constructs like GenericArray[4 * len("hello"), byte] is a valid instantiation.

DeedleFake commented 2 years ago

First of all, you don't need the interface wrapper. type E[T interface { K }] ... is equivalent to type E[T K] ..., so you could just do type GenericArrayCustomMarshal[T ~[...]byte] T.

Second of all, while it's a little odd, you could instantiate it without having to pass a constant into the types by doing this:

type CustomMarshalArray[T ~[...]E, E any] struct { V T }

type StructWithCustomMarshalArray struct {
  Array CustomMarshalArray[[3]byte, byte]
}

This would be simplified if E could be inferred from T, but Go doesn't seem to be able to do that at the moment.

emil14 commented 1 year ago

Another use-case for this is creating a more high-level language that compiles down to Go. If that language has arrays then there's need to generate very big amount of Go code per each array type. Ability to generalize array size could reduce amount of Go code generated.

BTW same things goes for structure (it's a different topic though) - ability to express types like "any structure" or "structure with at least these fields" could help in such tasks where Go is a compile-target. But the problem is Go's sub-typing - it's not structural for structures, only for interfaces

ianlancetaylor commented 1 year ago

Another use-case for this is creating a more high-level language that compiles down to Go. If that language has arrays then there's need to generate very big amount of Go code per each array type. Ability to generalize array size could reduce amount of Go code generated.

I don't want to side-track on this, but I would expect that for this case it would be appropriate to convert the higher-level arrays to Go slices. Most Go programs do not use arrays at all. They use slices for everything.

(This doesn't mean that this issue is useless. There are reasons to use arrays rather than slices. But it is uncommon.)

emil14 commented 1 year ago

@ianlancetaylor yes, and maps/slices could also be used instead of structs. But that adds performance/memory overhead

oxplot commented 1 year ago

I have a use-case for this:

A Vector Database that allows adding N-dimensional vectors and later look up nearest vectors to a given vector. Since it only makes sense to have all vectors be of the same dimension, it'd be nice to be able to define a generic VectorDB type that's parametric over the array size (the code below uses a made-up syntax [...]float64 to mean any array size of float64):

package main

import (
    "fmt"
    "sort"
)

type vectorDB[T [...]float64] struct {
    vectors map[T]string
}

func NewVectorDB[T [...]float64]() *vectorDB[T] {
    return &vectorDB[T]{map[T]string{}}
}

func (db *vectorDB[T]) Add(vec T, text string) {
    db.vectors[vec] = text
}

func (db *vectorDB[T]) Lookup(vec T, ktop int) []string {
    type vecDist struct {
        v T
        d float64
    }
    d := make([]vecDist, 0, len(db.vectors))
    for v := range db.vectors {
        d = append(d, vecDist{v, calcDist(v, vec)})
    }
    sort.Slice(d, func(i, j int) bool { return d[i].d < d[j].d })

    if len(db.vectors) < ktop {
        ktop = len(db.vectors)
    }
    results := make([]string, 0, ktop)
    for _, i := range d[:ktop] {
        results = append(results, db.vectors[i.v])
    }
    return results
}

func calcDist[T [...]float64](a, b T) float64 {
    var d float64
    for i, av := range a {
        d += (av - b[i]) * (av - b[i])
    }
    return d
}

func main() {
    db := NewVectorDB[[5]float64]()
    db.Add([5]float64{1, 2, 3, 4, 5}, "apple")
    db.Add([5]float64{1, 9, 3, 4, 5}, "orange")
    fmt.Printf("Result: %s\n", db.Lookup([5]float64{9, 1, 2, 3, 4}, 1)[0])
}