google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.31k stars 209 forks source link

allow Values to define < operator for arbitrary RHS operand types (with arbitrary result type) #541

Closed hungtcs closed 5 months ago

hungtcs commented 5 months ago

Hi All, Python is a flexible programming language, and so is Starlark! In Python, you can override comparison operations by overriding methods like __lt__, this can also be Implemented in starlark by implementing the CompareSameType interface. However, the starlark implementation lacks some flexibility. That is, I cannot customize the return value type of the comparison operation, This is actually very useful, for instance, Series in pandas allows for quick filtering by overriding the comparison operation:

You can run the following code here

import pandas as pd

s = pd.Series([15,12,13,22,32,11])
print(s)
print(s[s > 13])
print(s == 12)

Every type that implements the Value interface provides a Truth() Bool method, so it should be easy to convert any starlark value to a bool value.

https://github.com/google/starlark-go/blob/9b43f0afd5217a6556578fe90aafdda24cf7d532/starlark/value.go#L100-L101

adonovan commented 5 months ago

This is a change to the language spec, so the correct place to request it is at https://github.com/bazelbuild/starlark.

But I think it is unlikely to be approved, for two reasons, one stylistic and one practical:

hungtcs commented 5 months ago

Hi @adonovan

I'am sorry, I may not have stated my intention clearly. This is not a change to the starlark language specification, I'm not trying to support comparison overloading for starlark itself, just a few tweaks to the Go code implementation of Compare.

I created a simple PR to illustrate this: #542

With these adjustments, types that implement the CrossComparable (don't worry about the name. I just made it up.) interface can now return any Value type in their CompareWithType method, while the interpreter needs to defer to the Value's Truth() value.

Now suppose I implement a custom Series type, similar to Pandas' Series. It implements the CrossComparable interface and returns a new Series of type Bool in the CompareWithType method. It also implements the HasSetKey interface and supports Bool Series as SetKey values.

With the above steps, you can achieve something similar to the following code, and, I don't think it conflicts with the starlark specification.

import pandas as pd

s = pd.Series([15,12,13,22,32,11])
print(s)
print(s[s > 13])
print(s == 12)

@adonovan looking forward to your comments.

hungtcs commented 5 months ago

Here's the code I implemented via the API above.

package main

import (
    "fmt"

    "go.starlark.net/starlark"
    "go.starlark.net/syntax"
)

type Series struct {
    list *starlark.List
}

// Freeze implements starlark.Value.
func (s *Series) Freeze() {}

// Hash implements starlark.Value.
func (s *Series) Hash() (uint32, error) {
    return s.list.Hash()
}

// String implements starlark.Value.
func (s *Series) String() string {
    return s.list.String()
}

// Truth implements starlark.Value.
func (s *Series) Truth() starlark.Bool {
    return starlark.True
}

// Type implements starlark.Value.
func (s *Series) Type() string {
    return "Series"
}

// Index implements starlark.HasSetIndex.
func (s *Series) Index(i int) starlark.Value {
    return s.list.Index(i)
}

// Len implements starlark.HasSetIndex.
func (s *Series) Len() int {
    return s.list.Len()
}

// SetIndex implements starlark.HasSetIndex.
func (s *Series) SetIndex(index int, v starlark.Value) error {
    return s.list.SetIndex(index, v)
}

// Get implements starlark.HasSetKey.
func (s *Series) Get(k starlark.Value) (_ starlark.Value, found bool, err error) {
    // series key
    if ss, ok := k.(*Series); ok {
        var data = make([]starlark.Value, 0)
        for i := 0; i < ss.Len(); i++ {
            if ss.Index(i).Truth() {
                data = append(data, s.Index(i))
            }
        }
        return &Series{list: starlark.NewList(data)}, true, nil
    }

    // int key
    if iv, ok := k.(starlark.Int); ok {
        if ivv, ok := iv.Int64(); ok {
            return s.list.Index(int(ivv)), true, nil
        }
    }

    return nil, false, fmt.Errorf("not supported key: %v", k)
}

// SetKey implements starlark.HasSetKey.
func (s *Series) SetKey(k starlark.Value, v starlark.Value) error {
    panic("unimplemented")
}

// CompareWithType implements starlark.CrossComparable.
func (s *Series) CompareWithType(op syntax.Token, y starlark.Value) (starlark.Value, error) {
    var values = make([]starlark.Value, s.Len())
    for i := 0; i < s.Len(); i++ {
        if v, err := starlark.Compare(op, s.Index(i), y); err != nil {
            return nil, err
        } else {
            values[i] = v
        }
    }

    return &Series{list: starlark.NewList(values)}, nil
}

var (
    _ starlark.Value           = (*Series)(nil)
    _ starlark.HasSetKey       = (*Series)(nil)
    _ starlark.HasSetIndex     = (*Series)(nil)
    _ starlark.CrossComparable = (*Series)(nil)
)

var code = `
s = Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
print(s)
print(s >= 5)
print(s[s >= 5])
`

func main() {
    var thread = &starlark.Thread{}
    var predeclared = starlark.StringDict{
        "Series": starlark.NewBuiltin("Series", func(thread *starlark.Thread, fn *starlark.Builtin, args starlark.Tuple, kwargs []starlark.Tuple) (_ starlark.Value, err error) {
            var list *starlark.List
            err = starlark.UnpackArgs(fn.Name(), args, kwargs, "list", &list)
            if err != nil {
                return nil, err
            }

            return &Series{list: list}, nil
        }),
    }

    _, err := starlark.ExecFile(thread, "main.star", code, predeclared)
    if err != nil {
        panic(err)
    }
}

Its output is:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[False, False, False, False, False, True, True, True, True, True]
[5, 6, 7, 8, 9]
adonovan commented 5 months ago

I see: you want application-defined data types to be able to define comparison operators as "multimethods", whose behavior depends on the types of both operands, which may be differ from each other. (The only cases like this we currently support are float < int.)

Explaining the two examples:

This isn't a change to the language, so I'll reopen this issue. But I'm unconvinced that this is a good way to write programs. Readers expect that the type of a >= comparison is a boolean, and they expect that the i operand of s[i] where is a a sequence is an integer. Intentionally defeating these conventions, especially in a language that lacks static types, seems likely to cause confusion.

It's also needlessly inefficient. The comprehension [x for x in s if x >= 5] allocates one array whose size is asymptotically equal to the result; by contrast, the expression s[s >= 5] constructs an intermediate array that is the same size as s. If the filter selects only one element from a 1000-element list, this implementation uses 1000 times more memory than necessary.

hungtcs commented 5 months ago

Pandas is a data analysis library for python.

I'll just use Series as an example, since I'm trying to write a pandas-like library for starlark-go.

Many people have become accustomed to the unique syntax of pandas, I try to be as compatible with it as possible, which reduces the learning curve for the user.

Why not just use [x for x in] ?

Series is similar to list, but I implemented a lot of tool methods on it. Comparing a Series to other types (like int、float...) is really the same as comparing it to each item in the Series and generates a new Bool Series.

Series implements the Indexable interface, so it can use the [x for x in] statement, but this causes the result to be of type list without maintaining the original Series type. This makes it impossible to use many of the features offered by Series directly, and the user must manually switch back to Series again.

A similar implementation in Series, which implements the HasBinary interface so that Series can operate on other values and get a new Series. The HasBinary interface works fine for now, it would be nice if the Compare operation could do the same.

a = Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# I've implemented binary operations
b = a * 2
print(b) # Series([0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype='int')

# This is what I'm trying to achieve, I want to get the same effect as binary operations
c = a >= 5
print(c) # Series([False, False, False, False, False, True, True, True, True, True], dtype='bool')

In other words, this feature has no direct impact on the readability of the Starlark language itself. Because everything is provided by application-defined data types, when using application-defined, the user already knows exactly what kind of functionality it will provide and it's all compatible with starlark syntax.

Thanks.

adonovan commented 5 months ago

In other words, this feature has no direct impact on the readability of the Starlark language itself.

But that isn't really true. When a Starlark user sees an expression x < 5, they know they must have a boolean value. When they see seq[i], knowing that seq is an indexable sequence and not a dict, then they know that i must be an integer. These invariants provide vital clues when reading programs in untyped languages. They also provide prompt detection of mistakes: if you index a list with another list, you get a dynamic error, not some unexpected value. When every combination of operator and operand yields some value (as in some satires of the worst aspects of JavaScript), confusion reigns. No doubt these kinds of tricks allows you to make programs shorter, but shorter does not mean more readable.

Upon further consideration, I think this really is a change to essence of the language itself, so I will re-close this issue and recommend that you propose a spec change at https://github.com/bazelbuild/starlark.