Open ConnorGray opened 5 months ago
For anyone wanting to do this ticket, you can look at this diff for a good example of what to do: https://github.com/modularml/mojo/pull/2677/files (file stdlib/src/utils/static_tuple.mojo) and you can also use the Indexer
trait which is the right trait to work with when using indexes.
Intable
or Int
are not recommended when indexing something because
1) Int is very restrictive, we might want to index with types like UInt32
or Int64
2) Intable
is too broad and allows floats to be used for indexing which is obviously not correct since we can't do my_list[4.5]
I can take this up
For anyone wanting to do this ticket, you can look at this diff for a good example of what to do: https://github.com/modularml/mojo/pull/2677/files (file stdlib/src/utils/static_tuple.mojo) and you can also use the
Indexer
trait which is the right trait to work with when using indexes.
Intable
orInt
are not recommended when indexing something because
Actually, Int
is implicitly convertible from Indexer
, and it is currently the recommended way to define __getitem__
and other methods that take indices.
@laszlokindrat It may cause some confusion for readers of the code. One might assume that when implicitely converting to an Int
, the method __int__
is used (because they have the same name, and it's more used in Python than __index__
).
Using Indexer
explicitely might also make it easier to explore our options when the compiler drops the auto-conversion of types when a default constructor is available.
@JoeLoser might have a different take on this one since he merged https://github.com/modularml/mojo/pull/2800 recently, which made use of the Indexer
trait.
Anyway I don't want to get into a big debate here. If we don't agree on the subject, we can always talk about this again after the compiler drops the auto-implicit-conversion nonsense.
@laszlokindrat It may cause some confusion for readers of the code. One might assume that when implicitely converting to an
Int
, the method__int__
is used (because they have the same name, and it's more used in Python than__index__
).
We document the implicit conversion behavior in Indexer
. Also, in Python, the distinction between __int__
and __index__
is clear and the use of __index__
for slice creation is well known (which is the assumed semantics for __getitem__
). So I don't see much room for confusion here; you can only pass types implicitly as Int
that are either directly convertible using an Int
constructor (few annointed types), or that implement Indexer
(which implies that the type implementer understands the nuances). I do agree that implicit conversion can be confusing, but I don't think this Intable
vs Indexer
is the main source of that confusion.
Okay then, I won't push the matter further. So far I can only say that it confused me, and I had to double-check for bugs, and I was in this codebase for a while. So I wouldn't be surprised if the same thing happened for new contributors. But maybe it's just me.
We'll see with time if it really becomes an issue or not. Let's go with Int
instead of Indexer
for now.
Why did we remove the mojo-stdlib label?
It got auto removed when the issue got moved internally in our Linear system. I'll add it back. FYI @ewa.
I think this will not happen in the future anymore but flag me if you see anything strange.
hello @ConnorGray , I submitted a PR to address the migration within List
methods (#3400). As well, as of today, __setitem__
and __get_ref
methods do not exist anymore so you can remove them from the todo list I think.
hello @ConnorGray , I submitted a PR to address the migration within
List
methods (#3400). As well, as of today,__setitem__
and__get_ref
methods do not exist anymore so you can remove them from the todo list I think.
There were already PRs addressing this (https://github.com/modularml/mojo/pull/3400#issuecomment-2306919174)
@laszlokindrat
Actually,
Int
is implicitly convertible fromIndexer
, and it is currently the recommended way to define__getitem__
and other methods that take indices.
It has performance implication, currently there is a branch in each call to normalize_index
, checking if it is less than 0, which can be avoided in case of a UInt
Indexer.
one possible solution to avoid the branch would be to add an overload to the __index__
method
fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
"""
Return the normalized index value for a container of the given length.
Parameters:
insertion: A boolean indicating whether the index specifies an insertion location (True) or an element location (False).
Example - Insertion location: list.insert(len(list), item)
Example - Element location: list[len(list)] (raises out of bounds error).
Args:
length: The length of the container, used for bounds checking and index normalization.
Returns:
A `UInt` representing the index value.
"""
...
The Uint
implementation can just debug assert the bounds
without runtime performance implication of index normalization.
not sure about the name insertion
, maybe slice index
and element index
is better.
It has performance implication, currently there is a branch in each call to
normalize_index
, checking if it is less than 0, which can be avoided in case of aUInt
Indexer.
Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define __getitem__
with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType)
.
@laszlokindrat
Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define getitem with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType).
having the method being parameterized on the indexer type is safe, performant and shorter instead of defining multiple methods and APIs for many types.
def __getitem__[indexer: Indexer](self, idx: indexer) -> T:
var uidx: UInt = idx.__index__(len(self))
return self.data[uidx]
this is both safe and performant since for UInt the implementation will be something like:
fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
@parameter
if insertion:
debug_assert(self <= length)
else:
debug_assert(self < length)
return self
which will be optimized away when compiling without assertions. and for Int it will be:
fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
@parameter
if insertion:
debug_assert(-length <= self <= length)
else:
debug_assert(-length <= self < length)
if self < 0:
return UInt(self + length)
return UInt(self)
still having the normalization and bounds checking.
Yeah, it's because we want the default behavior to be safe. If you need extra perf, you can always define getitem with a different signature, or have an explicitly unsafe (but more performant) get_unsafe(idx: YourIndexType).
having the method being parameterized on the indexer type is safe, performant and shorter instead of defining multiple methods and APIs for many types.
The recommendation is there because we expect __getitem__
to work with Int
and support negative indexing safely (i.e. roughly Python semantics), while also keeping the implementation simple. For example, the method signature you provide above is not something familiar to a Python user:
def __getitem__[indexer: Indexer](self, idx: indexer) -> T:
and requires understanding traits. Also, I would consider it too permissive, and if you really wanted to do this, I would encourage
def __getitem__[indexer: Indexer, //](self, idx: indexer) -> T:
Clearly, def __getitem__(self, idx: Int) -> T:
is simpler.
I also must note that this recommendation does precede UInt
, and assumes the current definition and semantics of Indexer
, i.e. a trait that enables lossless conversion to Int
(because that's what the index
function does). What you are suggesting is a change to Indexer
, so we would have something like:
trait Indexer:
fn __index__(self) -> Int: ...
fn __index__[insertion: Bool = False](self, length: UInt) -> UInt: ...
First, I'm not even sure defaults can be declared in trait signatures yet. But assuming we have that (or omit the default), I would argue that we should not force everyone trying to opt into using index
to implement another, more complicated method. If we had default implementations in traits, we could do:
trait Indexer:
fn __index__(self) -> Int: ...
fn __index__[insertion: Bool = False](self, length: UInt) -> UInt:
return self.__index__().__index__[insertion](length)
with the implementation you had above for Int
, which could be nice.
There is also the closely related problem of indexing with Int8
, UInt8
, and other SIMD types without conversion to Int
(or maybe UInt
). If we want to solve efficient indexing with UInt
, maybe we should look at these as well.
If you are interested in this, a design doc would be most welcome! In particular, I would love to see an audit of which types would need to override the second __index__
method, and where the insertion
parameter would be used. Also, I would love to hear ideas on what the semantics of Indexer
would be with respect to convertibility to UInt
. I.e. do we now need lossless conversion to both Int
and UInt
? Notably, UInt
currently doesn't implement Indexer
because such conversion is fallible (which makes me wonder if parametric raising is a prereq for the right solution here).
@laszlokindrat I am not familiar enough with mojo to write a full design document, but I do have some ideas, maybe not helpful, but might inspire someone else who is more knowledgeable.
With regards to the signature of __getitem__
for the stdlib
types:
I obviously agree with you about being inferred.
with regards to signature complexity I don't think it matters too much in this case, because:
getitem
brackets, and usually users only look at function signatures they don't know how to use.As far as I can tell, the only types that would need to implement the second method would be: SIMD
, Int
, UInt
. other types will just return these ones.
With regards to lossy conversion to Int, I think this can be avoided by changing the signature, while still supporting Int
and index normalization:
trait Indexer:
alias BaseIndexer: Indexer # automatically inferred
alias UnsignedType: UInt | UInt8 | UInt16 | UInt32 | UInt64 # not sure how in mojo
fn __index__(self) -> BaseIndexer: ...
fn __index__[insertion: Bool = False](self, length: UnsignedType) -> UnsignedType:
return self.__index__().__index__[insertion](length)
The first index method will require to return any Indexer
.
When users implement the __index__
method they can use the normal signature returning an Int
. since Int
is another Indexer
, it is allowed.
The stdlib base indexers SIMD
, Int
, UInt
will return self
, they will be the base of "the indexer chain".
This way UInt
will return UInt
and won't be lossy converted to Int
. I think it's ok because for python users they are all int
doesn't matter if signed/unsigned or what size.
With regards to supporting Indexes of different sizes:
By providing to the second method a length of some UnsignedType
you are expecting to get back an Unsigned Index of the same type and size.
The length and the return Index have to be Unsigned
since it represents size. At the end of the "Indexer chain" in the stdlib container types, this method will be called to convert the provided Indexer to Unsigned Index of the container size type.
Sending a length smaller than the Indexer size (e.g. Int64.__index__(UInt8)
) could result in out of bounds and ideally should be a compile time error using conditional conformance, and users will be forced to downcast manually when they know it is safe. if not possible, the debug assert will catch the out of bounds error at runtime.
With regards to support for non scalar SIMD with multiple elements, either:
ATM the only two operation I can think of for the insertion parameter are slicing/insertion, but conceptually it should be familiar to anyone who understands slicing:
0 1 2 3 - slice index
|0|1|2| - item index
I think the second method will only be used by the stdlib container types and library authors which are advanced enough to understand the parameter meaning. But as you mentioned the insertion parameter is not mandatory for the implementation.
2677 added a new
normalized_index[..](..)
helper function to the standard library, used to perform bounds checking on indexes into collection types. This replaces boilerplate bounds checking and negative index normalization logic like:with a function call:
That PR updated only
InlineArray
to use the new helper function.We should update other collections in the standard library to use this function as well, including:
List.__setitem__
List.insert
List.pop
List.__getitem__
List.__get_ref
InlinedFixedVector.__getitem__
InlinedFixedVector.__setitem__
InlineArray.__getitem__[index: ..]
— will require callingnormalize_index
at compile time (e.g.alias normalize_idx = normalize_index["InlineArray"](i, size=size)
Please limit the number of call-sites updated to not more than 4-5 per PR, to make reviewing easier.
@gabrieldemarmiesse @laszlokindrat FYI