modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
23.3k stars 2.59k forks source link

[stdlib] Use `normalize_index` throughout the standard library #2948

Open ConnorGray opened 5 months ago

ConnorGray commented 5 months ago

2677 added a new normalized_index[..](..) helper function to the standard library, used to perform bounds checking on indexes into collection types. This replaces boilerplate bounds checking and negative index normalization logic like:

        debug_assert(-size <= int(index) < size, "Index must be within bounds.")
        var normalized_idx = int(index)
        if normalized_idx < 0:
            normalized_idx += size

with a function call:

        var normalized_index = normalize_index["InlineArray"](index, self[])

That PR updated only InlineArray to use the new helper function.

We should update other collections in the standard library to use this function as well, including:

Please limit the number of call-sites updated to not more than 4-5 per PR, to make reviewing easier.

@gabrieldemarmiesse @laszlokindrat FYI

gabrieldemarmiesse commented 5 months ago

For anyone wanting to do this ticket, you can look at this diff for a good example of what to do: https://github.com/modularml/mojo/pull/2677/files (file stdlib/src/utils/static_tuple.mojo) and you can also use the Indexer trait which is the right trait to work with when using indexes.

Intable or Int are not recommended when indexing something because 1) Int is very restrictive, we might want to index with types like UInt32 or Int64 2) Intable is too broad and allows floats to be used for indexing which is obviously not correct since we can't do my_list[4.5]

StandinKP commented 5 months ago

I can take this up

laszlokindrat commented 5 months ago

For anyone wanting to do this ticket, you can look at this diff for a good example of what to do: https://github.com/modularml/mojo/pull/2677/files (file stdlib/src/utils/static_tuple.mojo) and you can also use the Indexer trait which is the right trait to work with when using indexes.

Intable or Int are not recommended when indexing something because

Actually, Int is implicitly convertible from Indexer, and it is currently the recommended way to define __getitem__ and other methods that take indices.

gabrieldemarmiesse commented 5 months ago

@laszlokindrat It may cause some confusion for readers of the code. One might assume that when implicitely converting to an Int, the method __int__ is used (because they have the same name, and it's more used in Python than __index__).

Using Indexer explicitely might also make it easier to explore our options when the compiler drops the auto-conversion of types when a default constructor is available.

@JoeLoser might have a different take on this one since he merged https://github.com/modularml/mojo/pull/2800 recently, which made use of the Indexer trait.

Anyway I don't want to get into a big debate here. If we don't agree on the subject, we can always talk about this again after the compiler drops the auto-implicit-conversion nonsense.

laszlokindrat commented 5 months ago

@laszlokindrat It may cause some confusion for readers of the code. One might assume that when implicitely converting to an Int, the method __int__ is used (because they have the same name, and it's more used in Python than __index__).

We document the implicit conversion behavior in Indexer. Also, in Python, the distinction between __int__ and __index__ is clear and the use of __index__ for slice creation is well known (which is the assumed semantics for __getitem__). So I don't see much room for confusion here; you can only pass types implicitly as Int that are either directly convertible using an Int constructor (few annointed types), or that implement Indexer (which implies that the type implementer understands the nuances). I do agree that implicit conversion can be confusing, but I don't think this Intable vs Indexer is the main source of that confusion.

gabrieldemarmiesse commented 5 months ago

Okay then, I won't push the matter further. So far I can only say that it confused me, and I had to double-check for bugs, and I was in this codebase for a while. So I wouldn't be surprised if the same thing happened for new contributors. But maybe it's just me.

We'll see with time if it really becomes an issue or not. Let's go with Int instead of Indexer for now.

StandinKP commented 5 months ago

Why did we remove the mojo-stdlib label?

JoeLoser commented 5 months ago

It got auto removed when the issue got moved internally in our Linear system. I'll add it back. FYI @ewa.

ematejska commented 5 months ago

I think this will not happen in the future anymore but flag me if you see anything strange.

vguerra commented 2 months ago

hello @ConnorGray , I submitted a PR to address the migration within List methods (#3400). As well, as of today, __setitem__ and __get_ref methods do not exist anymore so you can remove them from the todo list I think.

vguerra commented 2 months ago

hello @ConnorGray , I submitted a PR to address the migration within List methods (#3400). As well, as of today, __setitem__ and __get_ref methods do not exist anymore so you can remove them from the todo list I think.

There were already PRs addressing this (https://github.com/modularml/mojo/pull/3400#issuecomment-2306919174)

yinonburgansky commented 1 month ago

@laszlokindrat

Actually, Int is implicitly convertible from Indexer, and it is currently the recommended way to define __getitem__ and other methods that take indices.

It has performance implication, currently there is a branch in each call to normalize_index, checking if it is less than 0, which can be avoided in case of a UInt Indexer.

one possible solution to avoid the branch would be to add an overload to the __index__ method

fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
        """
        Return the normalized index value for a container of the given length.

        Parameters:
            insertion: A boolean indicating whether the index specifies an insertion location (True) or an element location (False).
                Example - Insertion location: list.insert(len(list), item)
                Example - Element location: list[len(list)] (raises out of bounds error).

        Args:
            length: The length of the container, used for bounds checking and index normalization.

        Returns:
            A `UInt` representing the index value.
        """
        ...

The Uint implementation can just debug assert the bounds without runtime performance implication of index normalization.

not sure about the name insertion, maybe slice index and element index is better.

laszlokindrat commented 1 month ago

It has performance implication, currently there is a branch in each call to normalize_index, checking if it is less than 0, which can be avoided in case of a UInt Indexer.

Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define __getitem__ with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType).

yinonburgansky commented 1 month ago

@laszlokindrat

Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define getitem with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType).

having the method being parameterized on the indexer type is safe, performant and shorter instead of defining multiple methods and APIs for many types.

def __getitem__[indexer: Indexer](self, idx: indexer) -> T:
    var uidx: UInt = idx.__index__(len(self))
    return self.data[uidx]

this is both safe and performant since for UInt the implementation will be something like:

fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
    @parameter
    if insertion:
        debug_assert(self <= length)
    else:
        debug_assert(self < length)
    return self

which will be optimized away when compiling without assertions. and for Int it will be:

fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
    @parameter
    if insertion:
        debug_assert(-length <= self <= length)
    else:
        debug_assert(-length <= self < length)
    if self < 0:
         return UInt(self + length)
    return UInt(self)

still having the normalization and bounds checking.

laszlokindrat commented 1 month ago

Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define getitem with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType).

having the method being parameterized on the indexer type is safe, performant and shorter instead of defining multiple methods and APIs for many types.

The recommendation is there because we expect __getitem__ to work with Int and support negative indexing safely (i.e. roughly Python semantics), while also keeping the implementation simple. For example, the method signature you provide above is not something familiar to a Python user:

def __getitem__[indexer: Indexer](self, idx: indexer) -> T:

and requires understanding traits. Also, I would consider it too permissive, and if you really wanted to do this, I would encourage

def __getitem__[indexer: Indexer, //](self, idx: indexer) -> T:

Clearly, def __getitem__(self, idx: Int) -> T: is simpler.

I also must note that this recommendation does precede UInt, and assumes the current definition and semantics of Indexer, i.e. a trait that enables lossless conversion to Int (because that's what the index function does). What you are suggesting is a change to Indexer, so we would have something like:

trait Indexer:
    fn __index__(self) -> Int: ...
    fn __index__[insertion: Bool = False](self, length: UInt) -> UInt: ...

First, I'm not even sure defaults can be declared in trait signatures yet. But assuming we have that (or omit the default), I would argue that we should not force everyone trying to opt into using index to implement another, more complicated method. If we had default implementations in traits, we could do:

trait Indexer:
    fn __index__(self) -> Int: ...
    fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
        return self.__index__().__index__[insertion](length)

with the implementation you had above for Int, which could be nice.

There is also the closely related problem of indexing with Int8, UInt8, and other SIMD types without conversion to Int (or maybe UInt). If we want to solve efficient indexing with UInt, maybe we should look at these as well.

If you are interested in this, a design doc would be most welcome! In particular, I would love to see an audit of which types would need to override the second __index__ method, and where the insertion parameter would be used. Also, I would love to hear ideas on what the semantics of Indexer would be with respect to convertibility to UInt. I.e. do we now need lossless conversion to both Int and UInt? Notably, UInt currently doesn't implement Indexer because such conversion is fallible (which makes me wonder if parametric raising is a prereq for the right solution here).

yinonburgansky commented 1 month ago

@laszlokindrat I am not familiar enough with mojo to write a full design document, but I do have some ideas, maybe not helpful, but might inspire someone else who is more knowledgeable.

With regards to the signature of __getitem__ for the stdlib types:
I obviously agree with you about being inferred.
with regards to signature complexity I don't think it matters too much in this case, because:

As far as I can tell, the only types that would need to implement the second method would be: SIMD, Int, UInt. other types will just return these ones.

With regards to lossy conversion to Int, I think this can be avoided by changing the signature, while still supporting Int and index normalization:


trait Indexer:
    alias BaseIndexer: Indexer  # automatically inferred 
    alias UnsignedType: UInt | UInt8 | UInt16 | UInt32 | UInt64  # not sure how in mojo

    fn __index__(self) -> BaseIndexer: ...
    fn __index__[insertion: Bool = False](self, length: UnsignedType) -> UnsignedType:
        return self.__index__().__index__[insertion](length)

The first index method will require to return any Indexer. When users implement the __index__ method they can use the normal signature returning an Int. since Int is another Indexer, it is allowed. The stdlib base indexers SIMD, Int, UInt will return self, they will be the base of "the indexer chain". This way UInt will return UInt and won't be lossy converted to Int. I think it's ok because for python users they are all int doesn't matter if signed/unsigned or what size.

With regards to supporting Indexes of different sizes: By providing to the second method a length of some UnsignedType you are expecting to get back an Unsigned Index of the same type and size. The length and the return Index have to be Unsigned since it represents size. At the end of the "Indexer chain" in the stdlib container types, this method will be called to convert the provided Indexer to Unsigned Index of the container size type.

Sending a length smaller than the Indexer size (e.g. Int64.__index__(UInt8)) could result in out of bounds and ideally should be a compile time error using conditional conformance, and users will be forced to downcast manually when they know it is safe. if not possible, the debug assert will catch the out of bounds error at runtime.

With regards to support for non scalar SIMD with multiple elements, either:

ATM the only two operation I can think of for the insertion parameter are slicing/insertion, but conceptually it should be familiar to anyone who understands slicing:

0 1 2 3  - slice index
|0|1|2|  - item index

I think the second method will only be used by the stdlib container types and library authors which are advanced enough to understand the parameter meaning. But as you mentioned the insertion parameter is not mandatory for the implementation.