modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
23.34k stars 2.6k forks source link

[Feature Request] [proposal] A new `Buffer` type #3797

Open martinvuyk opened 1 day ago

martinvuyk commented 1 day ago

Review Mojo's priorities

What is your request?

This proposal will be very constrained and involve minor changes. It was made as an issue because of that.

Creating a Span-like type as Buffer

The current Span is a struct which is defined as "A non owning view of contiguous data".

Span is slowly becoming the de-facto way to exchange fat pointers over binary and any other collectable types. Span[Byte] is converging into Python's bytes.

Span currently allocates in its __getitem__(self, slc: Slice) -> Self if the step is negative, however if the programmer doesn't read the docstring where it warns that and the underlying UnsafePointer isn't taken ownership of, it will lead to a memory leak. I see more of these cases eventually coming up if we keep developing Span's API further. This opens the door to massive leaks.

Thusly, I propose we create a new type Buffer, which can also own its data.

How?

This part of the proposal can change and morph into something completely different.

Creating a new Pointer

One way would be to change the underlying Pointer to one that can be or not be owned.

Changing the _len field

Another way is to store the information of ownership in the upper bit of the Buffer's _len field, and thus avoiding adding struct padding overhead. The current proposal will be limited to storing a single bit: self_is_owner. Which in the Buffer's destructor would fork it into self._data.free() or doing nothing in case the Buffer doesn't own its data.

Disadvantages

Small issues:

Medium issues:

Changing the _data field

Using a Variant[List[T], Span[T]] as the underlying datatypes and branching on every function implementation for each case.

Future ideas

What we can build on top of this is a way to safely deal with pointer allocations and make people feel comfortable using them directly. We could make Buffer.alloc() a function that does memset_zero() to safely initialize the memory.

We could develop Buffer's API further:

fn main():
    buf = Buffer[Byte].alloc(3)
    buf[0] = ord("h")
    buf[1] = ord("i")
    buf[2] = 0
    print(String(Buffer(other=buf))) # copies
    print(String(buf^)) # zero copy
    # buf does not do self._data.free() inside the String constructor because
    # .steal_data() sets the self_is_owner bit to False.

CC: @owenhilyard this is a bit of an evolved version of what was proposed in #3728 CC: @JoeLoser this is enables #3239 (comment)

lsh commented 1 day ago

Span currently allocates in its __getitem__(self, slc: Slice) -> Self if the step is negative

This is incorrect as far as I know. Can you point out where that happens?

edit: seems like it does. I'm gonna look into this since it seems wrong.

lsh commented 1 day ago

I'm currently -1 on Span owning its data. I don't see a compelling reason articulated here of why Span should fuse with List, which is what you seem to be proposing. You mention the disadvantages, but I'm not clear on the advantages. People familiar with non owning data types in Zig, Go, Rust, or C++ will be familiar with this API, where what you're proposing diverges from what those users will be used to.

martinvuyk commented 1 day ago

Span currently allocates in its __getitem__(self, slc: Slice) -> Self if the step is negative

This is incorrect as far as I know. Can you point out where that happens?

nightly branch /utils/span.mojo

I don't see a compelling reason articulated here of why Span should fuse with List, which is what you seem to be proposing.

This is one of my motivations. Many other APIs would benefit from receiving a pointer which can be owned or not dynamically. You might want mutable references for which you don't own the data or sometimes you do, or APIs that can sometimes return owned or non owned data depending on some dynamic logic.

People familiar with non owning data types in Zig, Go, Rust, or C++ will be familiar with this API, where what you're proposing diverges from what those users will be used to.

They aren't trying to implement very dynamic Python-like logic at runtime. And that's why I also want to change the name (or create a new type with a different name and take the logic there), it's best to distance the concepts as much as possible.

lsh commented 1 day ago

I think I take issue with trying to force the idea of bytes top down onto a type rather than have it built bottom up from more natural systems idioms. Having owned bytes be List[Byte] and non owning be Span[Byte] works just fine, especially since List[Byte] can coerce into Span[Byte]

lsh commented 1 day ago

As for the issue of accidental memory leak: the solution in my opinion is to just not allocate at all.

owenhilyard commented 1 day ago

Strong -1 from me on making span owning. We want to have something which lets us do zero-copy slicing of List, InlineArray, etc. I think that making a DynamicArray[T] type which is owning but doesn't have the extra machinery which List has for resizing is a good idea.

struct DynamicArray[T]:
    var length: UInt
    var data: UnsafePointer[T]

    ...
lsh commented 1 day ago

For what it's worth, this idiom in Rust is a Box<[u8]>. I would be interested in making our OwnedPointer have a Span constructor, but that's a separate issue

owenhilyard commented 1 day ago

As for the issue of accidental memory leak: the solution in my opinion is to just not allocate at all.

+1, span is a zero-copy type, negative slices should probably error since there is no good way to handle them for a zero-copy type.

lsh commented 1 day ago

Span can implement __rev__, so if you want to iterate backwards the solution is probably for i in rev(my_span)

owenhilyard commented 1 day ago

For what it's worth, this idiom in Rust is a Box<[u8]>. I would be interested in making our OwnedPointer have a Span constructor, but that's a separate issue

I made it a separate type because Span is non-owning and has associated origin information, and I consider it closer to &[T] or &mut [T], so I would expect OwnedPointer(span) to produce a type like Box<&[T]>. I am fine with a span constructor which takes a ref to a DynamicArray, which is what I think the [T] type is represented by.

owenhilyard commented 1 day ago

Span can implement __rev__, so if you want to iterate backwards the solution is probably for i in rev(my_span)

+1

owenhilyard commented 1 day ago

In the case of changing to UInt for 32 bit computers, when the datatype is UInt8/Int8 the user will only be able to create a Buffer of up to 2 GiB instead of 4.

This is a fairly large issue considering that many accelerators are 32 bit.

martinvuyk commented 1 day ago

I repeatedly wrote that if we want to keep Span I have no problem with that. I'm not arguing against a non-owning view over data.

My main point here is we need a dynamic type that can sometimes own or not own its data to build Python-like APIs, and that type needs to be central in a lot of APIs for exchanging data because it is needed to build some logic that would horrify anyone with a desire for compile time safety guarantees.

If we go straight for only letting List own its data, we will leave a lot of performance on the table because a lot more allocations will need to be made that we could spare with a dynamic type like with what I'm proposing (which is the main focus here, not getting rid of Span).

lsh commented 1 day ago

In that case it seems like what you're arguing for is a Copy On Write data structure. That's already on the roadmap. See Rust's Cow type for reference.

lsh commented 1 day ago

This is actually possible now that we have associated aliases by the way!! :)

martinvuyk commented 1 day ago

In the case of changing to UInt for 32 bit computers, when the datatype is UInt8/Int8 the user will only be able to create a Buffer of up to 2 GiB instead of 4.

This is a fairly large issue considering that many accelerators are 32 bit.

Yes that is a big limitation (but it's something you can make a workaround for IMO)

In that case it seems like what you're arguing for is a Copy On Write data structure. That's already on the roadmap. See Rust's Cow type for reference.

No, I'm mostly aiming to be able to build a more generic version of Python's bytearray/bytes folded into one type (which works for all collection types and has functions constrained for scalar types). Its API has some very dynamic things that would benefit from not allocating in certain operations when possible and sometimes allocating.

owenhilyard commented 1 day ago

I agree that having List be the only owning type is not a good idea, which is why I pitched DynamicArray as an owning but non-resizable type to fill in that hole.

I agree with Lukas that you seem to be looking for Rust's Cow type.

Most python-like APIs, as soon as you make them multithreaded, fall apart from a safety standpoint because Python has had a giant mutex preventing any issues for 20 years. The entire school of python API design has essentially used !Send types, which we can do, but I think that designing thread-safe APIs is better even if pure python devs need to deal with some learning pains.

martinvuyk commented 1 day ago

I will strongly disagree here, I will not force Python devs to learn Rust idioms or C++ data structures. We need to meet people where they are. We can offer faster and/or safer alternatives for those willing to dive deeper. But we need to make Python-like logic possible and faster than in Python.

lsh commented 1 day ago

I'm not sure how you can write the type you're asking for without making a less generic Cow type. I'm not outright disagreeing with the idea of something more familiar to Python programmers, but if you want to create that API then it will probably use a Cow internally.

owenhilyard commented 1 day ago

I agree with Lukas that the api of python's bytes seems to fit Cow fairly well.

owenhilyard commented 1 day ago

You can make a unified bytes/bytearray type that uses copy on write to handle all of the places where the bytes API forces copies but still do in-place mutation where you are the actual owner of the bytes.

martinvuyk commented 1 day ago

I'm not so sure it will necessarily be CoW internally because you might pass someone a mutable reference with the intent of letting them mutate the data

lsh commented 1 day ago

That's complete fine with CoW. I'm not sure I see the issue?

owenhilyard commented 1 day ago

I will not force Python devs to learn Rust idioms or C++ data structures

Although Nick is working on a way to not fully require alias xor mutable in all cases, we don't want to design swaths of the standard library to not be thread safe. bytearray is, as written for python, inherently thread unsafe. Cow fixes that. If someone wants their code to be thread safe and faster, they are not going to be able to avoid dealing with the consequences of doing multiple things at a time. Copy on write is a computer science concept, not a Rust or C++ concept.

martinvuyk commented 1 day ago

Well, with copy on write I assume it means the data is copied and you get you own version every time you write? I've never used the Rust version, does it let you mutate inplace with a flag?

owenhilyard commented 1 day ago

I'm not so sure it will necessarily be CoW internally because you might pass someone a mutable reference with the intent of letting them mutate the data

In that case we add a third state to the enum with a mutable reference which can be downgraded to an immutable one if you start handing out more references.

Well, with copy on write I assume it means the data is copied and you get you own version every time you write? I've never used the Rust version, does it let you mutate inplace with a flag?

If you own the data, you can mutate in-place. If we take advantage of being able to more easily abstract over mutability in Mojo, we can probably find a way to have an exclusive mutable borrow which can do the same. However, if you try to write to something you have an immutable ref to, it will copy it.

lsh commented 1 day ago

A Mojo version of Cow would look something like:


struct Span[T: CollectionElement, ... ]:
    alias ToOwnedType = List[T]

trait ToOwned:
   alias ToOwnedType
   fn to_owned(self) -> Self.ToOwnedType

struct Cow[T: ToOwned]:
    var value: Variant[T, T.ToOwnedType]

struct Bytes[is_mutable: Bool, origin: Origin[is_mutable].type]:
    var data: Cow[Span[Byte, origin]]

so it would only do a copy the first time you write to it and will only allocate when necessary after that point.

lsh commented 1 day ago

Also unrelated: Mojo already has a type named Buffer in from buffer import Buffer. It's just not open sourced.

martinvuyk commented 1 day ago

If you own the data, you can mutate in-place [...] However, if you try to write to something you have an immutable ref to, it will copy it

Well that's my whole point, we need something that can be mutated without owning the data (or maybe owning it) dynamically. And yes I also like comp time guarantees, but I think this is needed.

so it would only do a copy the first time you write to it and will only allocate when necessary after that point.

It's the copy that I want to avoid, mutation in-place for something you don't own.

In that case we add a third state to the enum with a mutable reference which can be downgraded to an immutable one if you start handing out more references.

Won't that be essentially a Mutex? I'm trying to picture what you mean

Essentially I'm arguing for Pointer-like freedom with some extra safety on top, and trusting the implementation and tests to catch ownership bugs instead of the compiler.

Also unrelated: Mojo already has a type named Buffer in from buffer import Buffer. It's just not open sourced.

Damn, something like DynamicBuffer then lol

owenhilyard commented 1 day ago

A Mojo version of Cow would look something like:

struct Span[T: CollectionElement, ... ]:
    alias ToOwnedType = List[T]

trait ToOwned:
   alias ToOwnedType
   fn to_owned(self) -> Self.ToOwnedType

struct Cow[T: ToOwned]:
    var value: Variant[T, T.ToOwnedType]

struct Bytes[is_mutable: Bool, origin: Origin[is_mutable].type]:
    var data: Cow[Span[Byte, origin]]

so it would only do a copy the first time you write to it and will only allocate when necessary after that point.

We might want a ToMutable to let us have a mutable variant.

lsh commented 1 day ago

I think we won't need a to_mutable. If the lifetime of T is mutable, then you can mutate it in place without a copy. It could look like (since we don't have enums):

if self.is_borrowed():
    self.get_borrowed()[0] = 1
else:
    self.get_owned()[0] = 1
owenhilyard commented 1 day ago

I think we won't need a to_mutable. If the lifetime of T is mutable, then you can mutate it in place without a copy. It could look like (since we don't have enums):

if self.is_borrowed():
    self.get_borrowed()[0] = 1
else:
    self.get_owned()[0] = 1

I think that would work.

owenhilyard commented 1 day ago

In that case we add a third state to the enum with a mutable reference which can be downgraded to an immutable one if you start handing out more references.

Won't that be essentially a Mutex? I'm trying to picture what you mean

Lukas has a better idea.

owenhilyard commented 1 day ago

Essentially I'm arguing for Pointer-like freedom with some extra safety on top, and trusting the implementation and tests to catch ownership bugs instead of the compiler.

That's what a slice type is, it's just that you will need to do tons of copies to meet the API that python has created and some way to figure out who is responsible for freeing the data. Cow lets us cut those copies down and handle the ownership issue.

martinvuyk commented 1 day ago

value: Variant[T, T.ToOwnedType]

Won't that lead to a runtime if check every time you want to do something to it? or is that easily optimizable by the compiler?


Anyway, I'm not sure we have the features to pull it off yet (that internal trait alias thingy), and we can also change the underlying machinery out at any point since it's an implementation detail of the proposed new type. I guess we could just go with Variant[List[T], Span[T]] for now :man_shrugging: WDYT? (again, my main goal is about creating this new type to allow very dynamic code, not defining the exact implementation here)

owenhilyard commented 1 day ago

Won't that lead to a runtime if check every time you want to do something to it? or is that easily optimizable by the compiler?

The other option is the make a very messy API in the type system and use the typestate pattern to manage this. It turns into an if statement at runtime, so all we can do is hope the use has mostly similar things, otherwise the branch predictor gets mad.

martinvuyk commented 1 day ago

The other option is the make a very messy API in the type system and use the typestate pattern to manage this

hmm, I think that will constraint exactly what I want to set free. Or maybe I'm just not used to programming that way.

otherwise the branch predictor gets mad

I think that with Variant the compiler might declare that the variable doesn't get assigned any different type, and can signal that the function returns the same value on every call (?). We might actually get a perf boost here by using const if we get the keyword back.

I guess we could just go with Variant[List[T], Span[T]] for now

I've updated the proposal adding that alternative. I'll work a bit on the idea this weekend/next week (I got a lot of other pending stuff) and maybe even open a PR for a more thorough discussion of the implementation details.

owenhilyard commented 1 day ago

I think that with Variant the compiler might declare that the variable doesn't get assigned any different type, and can signal that the function returns the same value on every call (?). We might actually get a perf boost here by using const if we get the keyword back.

There will still be a runtime branch in unwrapping the variant. If you mix types you store in the variant, it will be poorly predicted.