python-trio / trio

Trio – a friendly Python library for async concurrency and I/O
https://trio.readthedocs.io
Other
6.23k stars 343 forks source link

Names for communication ABCs #1208

Open njsmith opened 5 years ago

njsmith commented 5 years ago

One of the major blockers for stabilizing Trio is that we need to stabilize our ABCs for channels/streams/pipes/whatever-they're-called. We've been iterating for a while, and I feel like we're pretty close on the core semantics, and I really want this to be done, but the fact is that I'm just not happy with the names, and those are kind of crucial. So here's an issue to try to sort it out once and for all.

Along the way, I've realized that there isn't really anything Trio-specific about these ABCs, and there are a bunch of advantages to making these ABCs something more widely used across the async Python ecosystem. So also CC'ing for feedback: @asvetlov @1st1 @agronholm @glyph (and feel free to add others). And I'll briefly review what we're trying to do for context before digging into the naming issue.

What problem are we trying to solve?

There are two main concepts that I think we need ABCs for:

There are many many concrete implementations of each. Trio currently ships with ~7 different implementations of its byte-oriented communication ABC (sockets, TLS, Unix pipes, Windows named pipes, ...) and we expect to add more; it's also an interface that's commonly exposed and consumed by third-party libraries (think of SSH channels, HTTP streaming response bodies, QUIC connections, ...).

Object-wise communication is maybe a bit less familiar because most frameworks don't call it out as a single category, but it's also something you see all over the place. Some examples:

Having standard ABCs for these has a lot of benefits:

If this is such a generic problem, then why should Trio be the one to solve it?

Well, no-one else seems to be working on it, and we need it :-). And of course we'd be happy if our work is useful to more people.

Asyncio will be adding an asyncio.Stream class in 3.8, but it doesn't seem to be intended as a generic abstraction with multiple implementations. It's a specific concrete class that's tightly tied to the asyncio's transport/protocols layer, and it exposes a rich public interface that includes buffering, line-splitting, fixed-length reads, and TLS.

This means that whenever someone needs a new kind of Stream object, they have two options: they can either define a new type that quacks like a Stream, which means they have to re-implement all this functionality from scratch. Or else, they have to implement their new functionality using the transport/protocols layer and then wrap an asyncio.Stream around it; but using the asyncio transport/protocol layer is awkward, adds overhead, and makes it very difficult to support other async libraries. Neither option is very appealing. IMO what we need is a minimal core interface, so that it's easy to implement, and then higher-level tools like buffering, line-splitting, TLS, etc. can be written once and re-used on any object that implements the byte-wise ABC.

And asyncio doesn't have an object-based communication abstraction at all, which IMO is a major missed opportunity.

Twisted OTOH does have standard abstractions for all these things, but they were designed 15+ years ago, long before async/await existed, and I think we can do better now. Heck, even the replacement @glyph's been working on for half a decade now predates async/await.

So the field seems wide open here.

What's already working?

We've been iterating on our APIs for these for a year+ now, and I think we're converging on a pretty solid design. You can see the current version in our docs: https://trio.readthedocs.io/en/stable/reference-io.html#abstract-base-classes

There are some details we're still sorting out – if you're curious see #1125, #823 / #1181, #371, #636 – but I don't want to get into the details too much because they're not really on-topic for this issue and I don't think they'll affect the overall adoption of the ABCs.

OK so what's this issue about then?

Like I said above, Trio currently uses trio.abc.Stream for the byte-wise interface, and trio.abc.Channel[T] for the object-wise interface. These names have two major problems:

So I think the right way to do it is to present the object-wise interface as the more fundamental one, and the byte-wise interface as a specialized variant. And we can communicate that right in the names, by calling the object-wise interface X[T], and the byte-wise interface a ByteX.

And then when someone asks what the difference is between a ByteX and an X[bytes], we can say: This is an X[bytes]. This is a ByteX. (click the images to see the animations)

But what should X be?

Can we steal from another system?

As mentioned above, AsyncIO has a concrete class Stream for byte-wise communication and a concrete class Queue for object-wise communication, but no relevant ABCs.

Go has a concrete type chan for communicating objects within a single process, and abstract Reader and Writer interfaces for byte-wise communication.

Nodejs has an abstract Stream interface, that they use for both byte-wise and object-wise communication. There's a mode argument you can set when creating the stream, that determines whether bytestrings can be rechunked or not. (It's pretty similar to some of the approaches we considered and rejected in #959.) The byte-wise mode is the default, and the object-wise mode seems like an afterthought (e.g. the docs say that nodejs itself never uses the object-wise mode).

Rust's Tokio module has an abstract Stream interface, which is basically equivalent to an async iterator in Python, or what Trio calls a trio.abc.ReceiveChannel[T]. And they also have an abstract Sink interface, which is equivalent to what Trio currently calls a trio.abc.SendChannel[T]. For bytes, they have abstract interfaces called AsyncRead and AsyncWrite. And they have generic framing tools to convert an AsyncRead into a Stream, or convert an AsyncWrite into a Sink.

Java's java.nio library uses Channel to mean, basically, anything with a close method (like Trio's AsyncResource). And then it has sub-interfaces like ByteChannel for byte-oriented communication. I don't think there's any abstract interface for framed/object-wise communication. Java's java.io library uses Stream to refer to byte-wise communication, and Reader and Writer for character-wise communication.

Swift NIO has an abstract Channel interface for byte-wise communication, and I don't see any interfaces for framed/object-wise communication.

My takeaways:

There's no general consensus on terminology. The words "stream" and "channel" show up a lot, but the meanings aren't consistent. "Read" and "write" are also popular, and are used consistently for byte-oriented interfaces.

There isn't any consensus on the basic concepts either! A major part of our ABC design is the insight that object-wise and byte-wise communication are both fundamental concepts, and that it's valuable to have standard interfaces to express both of them, and how they relate. But most of these frameworks only think seriously about byte-wise communication, and treat object-wise communication as an unrelated problem if they consider it at all. Tokio is the main exception, but their terminology is either ad hoc or motivated by other Rust-specific stuff. So we can't just steal an existing solution.

OK so what are our options?

I tried to brainstorm all the potential names I could think of, assuming we go with the X + ByteX pattern described above:

Criteria: I think ideally our core name should be a common, concrete English word that's short to say and to type, because those make the best names for fundamental concepts. It shouldn't be too "weird" or controversial, because people dislike weird names and will refuse to adopt them even if everything else is good. And of course we want it to be unambiguous, so we need to avoid name clashes.

Unfortunately it's really hard to get all of these at once, which is why I got stuck :-)

Stream is really solid on the first two criteria: it's one syllable, common, uncontroversial. But! It clashes with asyncio.Stream. That's not a problem for adoption in the Trio ecosystem. But it might be a major problem if we want this to get uptake across Python more broadly.

The other obvious option is Channel, but I'm really hesitant because it's more cumbersome: 2 syllables on their own aren't too bad, but by the time you start talking about a SocketByteChannel it feels extremely Java, not Python.

For some reason Transport doesn't bother me as much, even though it's two syllables as well, but then you have the clash with the whole protocols/transport terminology, and that also seems like it could be pretty confusing. And we're generally trying to move away from protocols/transports, but I don't want to have to tell beginners "we recommend you use Transports instead of protocols/transports". That's just going to make them more confused.

Among the rest, Tube is tempting as a simple, short, concrete word that doesn't conflict with anything in Trio or asyncio, and fits very nicely with illustrations like the ones I linked above, where there's a literal tube with objects moving through it. But... @glyph has been trying to make his tubes a thing for half a decade now. I think the proposal here is totally compatible with the goals and vision behind his version of tubes, and much more likely to get wider traction going forward. But OTOH the details are very different. So I can't tell whether using the name here would be the sincerest form of flattery, or super-rude, or both at once.

What do y'all think?

smurfix commented 5 years ago

IMO what we need is a minimal core interface, so that it's easy to implement, and then higher-level tools like buffering, line-splitting, TLS, etc. can be written once and re-used on any object that implements the byte-wise ABC

Strongly seconded. In an ideal world we would rework asyncio.Stream to do exactly this, instead of releasing it in 3.8 with all these bells and whistles … but I'm afraid it's too late for that.

I'd go with Flow.

1st1 commented 5 years ago

In an ideal world we would rework asyncio.Stream to do exactly this, instead of releasing it in 3.8 with all these bells and whistles …

Well, in an ideal world we would just have Trio to be the default framework of choice for async in all languages. The reality is different though: asyncio.Stream is designed the way it is designed to preserve backwards compatibility, and so unfortunately there's not a lot of room left to change it. Not including it in 3.8 wouldn't change any of that.

That said, if there's something we can do to make it easier for people to write adapters from Trio streams to asyncio Streams (if that's even a thing) we'll gladly consider that!

sorcio commented 5 years ago
  • There actually is an important conceptual relationship between them that IMO we should emphasize. Conceptually, byte-streams are basically object-streams where the object type is "a single byte". But if you try to use an interface designed for sending/receiving objects to send/receive individual bytes, then it'll be ridiculously inefficient, so instead you need a vectorized interface that works on whole bytestrings at once.

Can this be made part of the same abstract interface? What receive_some() is saying is "I want some sequential items of type T, if you're able to provide them" and returns a Sequence[T]ish. The way the underlying buffer is managed is left to the individual implementation. A default implementation can just basically return [await self.receive()]. Something backed by a file object might translate it into a read(). On the other hand, I believe that a bytestream kind of object can always have a pretty inefficient receive() method that takes a single byte off the underlying buffer, e.g. read(1). I'm not sure this would ever be useful for anything other than bytes, but maybe the mismatch is smaller if we think of stream operations as just "arbitrary sequential chunks of the basic type"?

This would be more problematic for send_all() because it cannot guarantee in general that the sequence is kept intact, i.e. if you translate it into a series of send() you might get interleaved sends from other tasks. But we already have the same problem in Trio Stream right now, and maybe it becomes more obvious.

asvetlov commented 5 years ago

Agree with @1st1 asyncio.Stream was not designed from scratch but just a merge of already existing StreamReader + StreamWriter APIs plus some backward-compatible improvements. Sorry, in asyncio we should provide very conservative backward compatibility politics for any change. Sometimes it prevents us from making code as beautiful as possible but this is the CPython game rule.

smurfix commented 5 years ago

That said, if there's something we can do to make it easier for people to write adapters from Trio streams to asyncio Streams (if that's even a thing) we'll gladly consider that!

Umm, yeah. My take would be to make the whole thing modular.

Details: Add a BaseStream abstract interface that requires just the three basic methods (read_some, write_all, close). Create a TransportStream class that implements this interface in terms of protocol+transport compatibility, ideally with the aim of someday deprecating protocols and transports altogether. (Sync callback hell.) Let Stream be a shallow class that implements all the complicated read and write operations in terms of calling read and write of an underlying BaseStream object. Create a SSLFilter class that implements SSL on top of that. Stream.start_tls can then simply push a SSLFilter below itself. Implement aBufferFilter` that buffers data for you.

Most of these filters would even be framework agnostic; they would work with trio, asyncio, or whateverio. (As long as nobody needs sub-tasks or timeouts, of course.) You don't even need to write the SSLFilter, as Trio already has one. Steal it. Everything else seems reasonably trivial at first glance.

NB: IMHO: Whoever invented a write method which one may or may not await deserves [removed]. Ugh.

1st1 commented 5 years ago

NB: IMHO: Whoever invented a write method which one may or may not await deserves [removed]. Ugh.

Oh, hm, that was me. We're still debating about that though. One of the solutions is to add await Stream.send() and keep Stream.write() as is (i.e. not returning a `Future). Do you think that would be a better idea? (Because we still can change it!)

smurfix commented 5 years ago

Yes, please. Or call it write_all like in Trio.

Arguments against this idea: (a) this prevents automated testing (with mypy et al.) whether somebody forgot an await, (b) prevents automatic warnings when one of these awaitables is garbage collected without having been awaited on (these are very helpful IME), (c) requires everybody who implements this interface to add an unlimited send buffer and to copy/reimplement the special dance you'd need to return such an awaitable.

1st1 commented 5 years ago

Details: Add a BaseStream abstract interface that requires just the three basic methods (read_some, write_all, close). Create a TransportStream class that implements this interface in terms of protocol+transport compatibility, ideally with the aim of someday deprecating protocols and transports altogether. (Sync callback hell.) Let Stream be a shallow class that implements all the complicated read and write operations in terms of calling read and write of an underlying BaseStream object. Create a SSLFilter class that implements SSL on top of that. Stream.start_tls can then simply push a SSLFilter below itself. Implement aBufferFilter` that buffers data for you.

I actually like all of this.

BUT asyncio.Stream.write() is this weird method that's currently can be awaited or not really. asyncio.Stream.close() isn't awaitable and just broken in my opinion. Our goal is to make the API nicer and less confusing, and returning an optional Future did look like an acceptable solution.

1st1 commented 5 years ago

How about Trio renames read_some() to read() as is makes it easier for us. I'm against having read_some() as a simple alias for read().

We can maybe work around the weird nature of write(). Options: revert the change, add send() or writeall(); deprecate write() and drain().

And lastly we'll have a problem with asyncio.Stream.close(), which should be awaitable but isn't quite.

smurfix commented 5 years ago

asyncio.Stream.close() isn't awaitable and just broken

Well, sometimes you really do need to kill off a connection Right Now, no matter what. I'd rename it to abort and deprecate close, though.

The nice, everybody-should-use-it async version might be named aclose

1st1 commented 5 years ago

I've discussed this with @ambv and here's the plan I like for 3.8:


Well, sometimes you really do need to kill off a connection Right Now, no matter what. I'd rename it to abort and deprecate close, though.

I'm fine with adding Stream.abort(), not super excited about your ideas about the close()/aclose() method.

The key question I have now is this: If we do the above, would that be better for Trio in any way?

smurfix commented 5 years ago

The key question I have now is this: If we do the above, would that be better for Trio in any way?

Well, I'm not @njsmith but from my PoV, yes sure. Splitting up Stream would also help a lot, not just for Trio interoperability; I should be able to help out if there's a chance to get that into 3.8 even though my free time is annyoingly limited.

The other problem that we still have is the one this topic is supposed to be about, i.e. naming things. If asyncio is set on keeping plain Stream for what Nathaniel tried to name ByteStream, which I personally can live with, that requires us to come up with a reasonable name for pipes/flows/whatever that happen to not consist of chunks of bytes.

1st1 commented 5 years ago

I should be able to help out if there's a chance to get that into 3.8 even though my free time is annyoingly limited.

Sure, we'd accept help. Splitting up sounds interesting, as I'm a big fan of composable component-oriented APIs such as Twitter's Finagle.

cc @asvetlov

agronholm commented 5 years ago

I'm all for standardization across the board, especially if it ends up allowing me to reduce the code base of AnyIO. Because the semantics of network functionality of the various I/O libraries were so wildly different, I had to invent my own. I would jump at the opportunity to throw out that code and piggyback on the individual network layers of each framework, provided that the semantics were in harmony with each other. But, it seems like there will be a long way to go until that happens.

I have no real preferences about the naming of the ABCs, so long as they're descriptive and everybody agrees on them.

@1st1 Are you still planning to make that asyncio-next library?

njsmith commented 5 years ago

@smurfix

Whoever invented a write method which one may or may not await deserves

Hey man, I know this was humorous exaggeration, but let's criticize tech not people, and skip the violent imagery. I edited your post to remove that bit.

@1st1

How about Trio renames read_some() to read() as is makes it easier for us. I'm against having read_some() as a simple alias for read().

Trio currently calls the method receive_some, and it has different semantics from your read: with no arguments, receive_some returns an arbitrary sized chunk of data, while read with no arguments consumes the whole stream until EOF. This kind of issue is why we've avoided using read as our verb. We also have a dedicated issue for method names with more discussion: #1125

The different names also have a potential advantage here: they mean the same object could potentially expose both a backwards-compatible set of operations for legacy asyncio code, and also implement this hypothetical ABC we're talking about with the improved semantics. (And the new methods could skip straight to the desired semantics while the old ones are working their way through their deprecation periods.)

There is one big conflict though: iteration. Trio's byte stream interface allows async iteration and it returns arbitrary byte chunks. asyncio.Stream also supports async iteration, but it returns lines. IMO the Trio version is substantially more useful – we only added this recently and it made a lot of code simpler. But you can't support both on the same object at the same time.

We don't have to sort out all the details right now, but maybe for 3.8 you might consider replacing asyncio.Stream.__aiter__ with a generator method like asyncio.Stream.iterlines(), to keep your options open in the future?

@smurfix

Well, sometimes you really do need to kill off a connection Right Now, no matter what. I'd rename it to abort and deprecate close, though.

In Trio this is actually handled as part of the trio.abc.AsyncResource interface. It doesn't just tell you how to spell the close method name (aclose) but also specifies that it should normally do a graceful close, but if cancelled it must do a forceful close. This is necessary for general correctness, but it also means we don't need a separate method for abort – instead we have a helper trio.forceful_close that can do an abort on any kind of AsyncResource, by calling aclose and immediately cancelling it. The downside compared to your proposal is that it means forceful close is an async-colored operation, not sync-colored... but it turns out this actually has benefits, because for e.g. disk files, the OS doesn't provide any way to close it without blocking, so we actually need forceful close to be async-colored.

jtrakk commented 5 years ago

I had a similar reaction to @sorcio's about this bit:

Conceptually, byte-streams are basically object-streams where the object type is "a single byte". But if you try to use an interface designed for sending/receiving objects to send/receive individual bytes, then it'll be ridiculously inefficient, so instead you need a vectorized interface that works on whole bytestrings at once.

It seems like there's a generalization of BytesTube, which is

BytesTube = ChunkTube[Byte]

or

Bytes = Chunk[Byte]
BytesTube = Tube[Bytes]
njsmith commented 5 years ago

Capturing some comments from chat:

@dhirschfeld

Maybe Channel[T] is fine and Stream just needs to be renamed to ByteStream to prevent any confusion. It breaks the naming symmetry you have in the post but that does highlight that they're different things so ¯\(ツ)/¯?

@oakkitten

if there's a Stream[bytes] and ByteStream, that's even more confusing one day someone will want to subclass Stream[bytes] that makes bytes uppercase and they will name it UppercasingByteStream

My response: Surely they'd at least call it UppercasingBytesStream, emphasis on the "s" :-). Though it would probably make more sense to call it UppercasingStream?

@Fuyukai

I mean the stdlib isn't any better List[bytes] and bytearray comes to mind

My response: Huh, that's a great analogy. We should steal that for the docs if nothing else. Something like:

...well that leaves out the part where streams are bidirectional, not sure how to work that in. But this is obviously one of those things that's just intrinsically confusing the first time you encounter it, so we'll want to take the time and explain it 3 different ways and hope that at least one of them works. And this is a great addition to our quiver.

jtrakk commented 5 years ago

A few of these are interesting.

Plus,

glyph commented 5 years ago

For what it's worth, I don't love the idea of using Tube here, since it's hard enough to deal with the names of those abstractions without bumping into name conflicts all over the place :-).

glyph commented 5 years ago

(In tubes, for example, when Tube gets its mypy-ification, it'll be Tube[InT, OutT] because tubes do transformation in that context; the abstraction in tubes that looks like what I think Tube[T] here does is Fount[T].)

njsmith commented 5 years ago

@glyph The thing we're talking about here is what tubes calls a Flow. Though, in this architecture, there's no distinction between Flow and Tube – if you want to layer an OuterT flow on top of an InnerT flow, you just do:

class TransformedFlow(Flow[OuterT]):
    def __init__(self, inner: Flow[InnerT]):
       ...

outer_flow = TransformedFlow(inner_flow)

(There are unidirectional versions too of course, I'm just illustrating the general point.) It sort of reminds me of that famous line, "have you considered calling a function with arguments?". Async functions/methods enable a lot of simple patterns that didn't use to be possible.

Help us out though :-). I'm know you'd love to see the Python ecosystem develop standard, composable abstractions for this stuff that become popular and ubiquitous, way or another. The goals here are identical to tubes's goals. And even if you prefer the tubes approach, hopefully we can agree that it's possible that this approach is the one that will catch on, and we all want whatever does catch on to have some nice friendly naming. So what would you use?

1st1 commented 5 years ago

We don't have to sort out all the details right now, but maybe for 3.8 you might consider replacing asyncio.Stream.__aiter__ with a generator method like asyncio.Stream.iterlines(), to keep your options open in the future?

That's a bit too much work for us -- changing the semantics of .write()/.close() is already bad enough, as it requires users of asyncio subprocess API to update their code. Asking them to deal with yet another significant API change is not an option. :(

I have a counter proposal though -- why doesn't Trio implement the default __aiter__ to iterate over lines, and have a .iterbytes() or .iterchunks() to iterate over byte chunks? Or, perhaps, have no __aiter__ at all, and have .iterlines() and .iterchunks() -- that we can do in asyncio too.

njsmith commented 5 years ago

@1st1 I'm only talking about the new asyncio.Stream type. People already have to update their code to use it :-). (For that matter, can you fix write/close immediately on Stream?)

why doesn't Trio implement the default __aiter__ to iterate over lines

Because we don't have line-splitting functionality at all in these types – adding it would be a major change because it requires adding a buffer and an end-of-line scanning algorithm, and both of those are quite tricky because the naive algorithm is O(N**2). And the whole idea here is to separate those kinds of algorithms off into a single robust implementation instead of duplicating them in every class.

Or, perhaps, have no __aiter__ at all, and have .iterlines() and .iterchunks() -- that we can do in asyncio too.

That's technically feasible, and how it used to work in Trio, but it's unattractive because the __aiter__ saves so much boilerplate:

# what people used to have to write every time they used a Trio stream
while True:
    chunk = await stream.receive_some()
    if not chunk:
        # EOF
        break
    # ... handle chunk ...

# What they write now
async for chunk in stream:
    # ... handle chunk ...

It's not just the extra typing; it's that this particular boilerplate is especially error-prone and confusing to new users. It used to be that basically every time someone posted their first program in the Trio chat to ask for feedback, they had a bug in this loop. Also all our tutorial examples got substantially simpler.

gvanrossum commented 5 years ago

I guess I'm too old for this -- I don't understand why "stream of objects" would be more fundamental than "stream of bytes", and I thought that the original terms Stream and Channel[T] were exactly right.

njsmith commented 5 years ago

Hi Guido!

When I say "more fundamental", I mean it's a more general concept – "stream of bytes" is a special case of "stream of things". In an idealized theoretical world, what trio currently calls Stream wouldn't exist, and instead we'd represent a TCP connection as a Channel[Byte]. But doing a separate method call for every byte is way too inefficient, plus Python doesn't even have a Byte type, so Channel[Byte] is totally impractical. Stream exists as an optimized, special-case type to fill the gap left by Channel[Byte]. But it would still be nice if people could take their knowledge of "stream of things", and re-use that knowledge to understand "stream of bytes". I'm not 100% happy with the ByteChannel idea, but it feels better to be able to tell people "ByteChannel is the same as Channel[Byte], just more split off to be more efficient" versus giving them totally unrelated names.

This is a super difficult set of concepts to pin down, and there are at least 3 ways of splitting them up that are natural in some situations, and run into problems in others. (@glyph's tubes actually uses a different one again from what either of us are saying...) I've been struggling with this for the last years, and I think the approach I'm advocating is the one that leads to the least awkwardness overall, but I also totally see why it's not obvious which one is best!

But even if we kept Stream as the name for abstract byte-streams, we'd still have a problem, since the details of the asyncio.Stream interface aren't a good fit for a generic abstract interface, and it would be awkward if asyncio.Stream wasn't a Stream! I wasn't worrying about this when I came up with the original terms, but it turns out to be more important than I thought.

smurfix commented 5 years ago

but it feels better to be able to tell people "ByteChannel is the same as Channel[Byte], just more split off to be more efficient"

In other words, possibly-different-enough to get a separate name, and mayyybe different accessors (read_some for random-chunks-of-byte streams vs. get for discrete-object channels?). We already debated this in different issues which I'm not going to link to again, without arriving at a consensus, but the problem now comes to a head, with the 3.8 release and its accompanying new asyncio interfaces looming quite large (wait, what, we're at beta 4 already, when did that happen?? :-/ ).

I think the approach I'm advocating is the one that leads to the least awkwardness overall, but I also totally see why it's not obvious which one is best!

Seconded, for what that's worth, and with the above naming caveat.

But even if we kept Stream as the name for abstract byte-streams, we'd still have a problem, since the details of the asyncio.Stream interface aren't a good fit for a generic abstract interface

Exactly. If asyncio keeps this Stream class with all its non-basic-concept methods (SSL, line split, …) we need to name our "Stream" concept something else. And frankly, while I used BaseStream in my split-asyncio.Stream-up draft, that name is kindof too awkward for a fundamental concept.

njsmith commented 5 years ago

In other words, possibly-different-enough to get a separate name

I could be convinced to use a separate name. My big problem with Stream and Channel is that we could have just as easily named them Channel and Stream. If you have two related-but-different concepts, then the names should give you a clue about which is which!

asvetlov commented 5 years ago

The new asyncio.Stream is used in old subprocesses API. await asyncio.create_subprocess_shell() and await asyncio.create_subprocess_exec() returns a Process object that has stdin, stdout and stderr properties. These properties are asyncio.Stream instances.

That's why in the new stream design we carefully reproduce the existing old API methods.

smurfix commented 5 years ago

@njsmith Well, that's easy. A channel transports distinct sequential entities. Ships, in the real world. A stream (or maybe a flow) of water, on the other hand, doesn't have hard boundaries, you cut it off wherever convenient (e.g. when the bucket is full).

@asvetlov These properties could equally well be asyncio.Stream (or even asyncio.abc.Stream) subclasses. We could call the actual class of these attributes ProcessStream or whatever. There's no reason the method I outlined in https://github.com/python-trio/trio/issues/1208#issuecomment-531289966 won't work (or at least I don't see any at first+second glance).

gvanrossum commented 5 years ago

+1 on the naturalness of Stream and Channel. Naming ain’t easy, but I could not imagining swapping these two. I don’t think that saying a Stream is a Channel[byte] helps with understanding. Unless you are trying to prove something purely mathematical and abstract. But that’s not common in programming. -- --Guido (mobile)

asvetlov commented 5 years ago

@smurfix if you wnt to improve asyncio -- you are welcome! Please file an issue on https://bugs.python.org and prepare a Pull Request on https://github.com/python/cpython

smurfix commented 5 years ago

@asvetlov The challenge is not to code (or even document) this, that's easy, I can steal whole-sale from Trio.

The challenge is to convince my employer that this is worth enough $$$ to them to allow me to interrupt whatever else I am doing.

oremanj commented 5 years ago

I also think we should keep Stream and Channel with their current meanings. I admit that there's no ironclad confusion-free argument for them, but I think they're consistent with the preponderance of analogous uses that users are likely to be familiar with -- not "perfect", but "good enough". I'm worried we'd be focusing too much on the possibility for confusion and not enough on other relevant factors like

asvetlov commented 5 years ago

@smurfix thus please keep do what you are doing but don't prevent @1st1 and I do our (unpaid) work on CPython if you are unable to help but love to blame only.

njsmith commented 5 years ago

@smurfix

A channel transports distinct sequential entities. Ships, in the real world. A stream (or maybe a flow) of water, on the other hand, doesn't have hard boundaries, you cut it off wherever convenient (e.g. when the bucket is full).

@gvanrossum

I could not imagining swapping these two

@oremanj

I think they're consistent with the preponderance of analogous uses that users are likely to be familiar with

A tricky thing about names is that once you get used to them, they seem natural and obvious; but this doesn't necessarily make them natural and obvious to newcomers. I don't want to pick on @smurfix's comment because I've done the same thing many times, but it's a good example of how our minds come up with post-hoc rationalizations: channels also transport water, and streams are used to transport items on boats. I've also come to find the stream/channel names intuitively obvious, but I don't think intuition is a trustworthy guide here :-). So we should look for outside evidence, beyond intuition. I can think of two sources, and they both contradict the intuition:

First, while the stream/channel split is obvious to me now, it took me a long time to get there. I was talking about them regularly for months, and I kept mixing them up and saying the wrong thing. Maybe this is just because I'm a weirdo, and no-one else will have this problem :-). But in the absence of more evidence, I think the most likely conclusion is that probably some non-trivial percentage of people will find the two names obvious and natural, and some non-trivial percentage will struggle and be confused.

And then we have one other source of outside evidence, which is what other projects came up with when they're designing a system from scratch, before they've had a chance to train their intuition. Using my little survey above, and looking just at "stream" and "channel", we have:

Names for byte-streams:

Stream Channel
asyncio ✔️
Java ✔️ ✔️
Swift ✔️
Go
Rust
Nodejs ✔️

Names for object-streams:

Stream Channel
asyncio
Java
Swift
Go ✔️
Rust ✔️ ✔️
Nodejs ✔️

So in this tiny survey, when people are picking between these words for byte-streams, they pick "Stream" 60% of the time and "Channel" 40% of the time, and when they're trying to refer to object-streams, they pick "Stream" 50% of the time and "Channel" 50% of the time.

My conclusion: if we pick a random developer who hasn't already memorized our naming scheme, then they might be a little bit more likely to associate stream->bytes and channel->objects, but if so it's a pretty weak effect. Definitely we can't say "oh yeah, everyone will understand which is which, it's obvious".

Neither of these data sources are very compelling BTW – if anyone has other ideas for how we can get evidence, then please share :-).

But it might be all academic anyway, since right now it's not clear if Stream is even a viable option...

@oremanj

I'm worried we'd be focusing too much on the possibility for confusion and not enough on other relevant factors like

  • switching costs: many of our high-level networking APIs have stream in the name, for example
  • weirdness budget: many people have seen "stream" and "channel" in network/communications related contexts before, which is not true of most other terms we might adopt

Yeah, this is why I hesitated so long before opening this issue. The big factor that tipped me over is the interoperability issue. Within Trio, we can deal with switching costs if we have to – it'll be annoying and kinda suck for everyone, but it's a one-time thing, our users know that things are still unstable, and we can get it done. What I'm more worried about is a scenario where Trio and asyncio end up with different competing versions of streams, and then projects like HTTP clients and servers, and anyio, and asyncssh, and so on are stuck in a position where they have to choose between them. Then Trio loses, because fewer people will support Trio, and asyncio loses, because they miss out on the better design, and it could become this source of friction and inefficiency and fighting for everyone that just goes on and on indefinitely. And if the ABC-based system is fighting with asyncio over the meaning of Stream, then it will be hard to convince anyone that the ABC-based system is really a neutral, asyncio-friendly thing.

So to me the big question is: can we find some names that are acceptable to everyone and won't be a barrier to cross-project adoption. And then within that set, we should try to optimize for minimal confusion and minimal weirdness.

(And the other advantage of the X+ByteX naming scheme is that you only have to come up with one good non-conflicting name, instead of two!)

gvanrossum commented 5 years ago

I propose not to give "intuitive" too much weight. As a clever person once said, "the only intuitive UI is the nipple; everything else is learned." [Citation needed] Perhaps more relevant, in ABC, the language that was Python's predecessor, the designers were keen on coming up with new names for existing concepts (e.g. "procedure", "boolean expression" became "HOW'TO", "test"). But it turned out this was mostly a mistake, as users were either coming from another programming language with more traditional terminology, or were going to learn such a language.

IOW my recommendation is to give in to tradition (even if it's not unanimous), and use Stream and Channel[T]. Stream is coming from a long tradition (e.g. the libc manual uses the word without explanation, instead focusing on the confusion between streams and files inherent in the naming choices of the early C stdio library). Channel may be even older, originating in CSP (according to Wikipedia).

It's totally fine to say that a stream is a channel of bytes with extra stuff added (and most people will use the extra stuff without caring much about the channel operations). But ByteStream is just ugly.

Regarding asyncio, I think we should fix the mistakes I made originally with various bad APIs -- but I don't think choosing the name Stream was one of them.

oremanj commented 5 years ago

As discussed above, I don't see any reason why a future revision of asyncio.Stream couldn't conform backwards-compatibly to Trio's current Stream ABC, except for __aiter__. And I think __aiter__ is more easily resolved than was stated above:

Or, perhaps, have no __aiter__ at all, and have .iterlines() and .iterchunks() -- that we can do in asyncio too.

That's technically feasible, and how it used to work in Trio, but it's unattractive because the __aiter__ saves so much boilerplate:

It seems to me that async for chunk in stream.iterchunks(): is almost as good boilerplate-wise as async for chunk in stream:; slightly more typing, but also more explicit. I don't see why distinguishing iterlines from iterchunks would require us to go back to the while-True-receive-if-not-break idiom.

energizah commented 5 years ago
oremanj commented 5 years ago

It's obvious when a Stream[T] Is needed vs a ByteStream. When I look up "how do I send data in Trio?" and StackOverflow says Channel[T], I'll set about making a Channel[bytes] (oops!). Whereas if it says Stream[T], I'll think, "oh and they've made a convenience wrapper called ByteStream already, how nice."

But see... a Channel[bytes] is in fact a meaningful concept that you might want to use: it models a transport that preserves message boundaries, like a datagram socket or websocket. A stream does not preserve message boundaries, but many stream implementations look like they do until you put them under high load. I personally think the potential for confusion between ByteChannel and Channel[bytes] (or ByteStream and Stream[bytes], for that matter) is far greater than the potential for confusion between Stream and Channel[T].

smurfix commented 5 years ago

@oremanj +1

iterchunks vs. iterlines vs. __aiter__

As soon as we use line breaks for the default iterator we're back to demanding that the basic Stream class provides scanning for them, thus requiring buffering and thus more complex implementations when reading. We should avoid that.

Also, as soon as you break your stream into lines it's no longer a Stream. It's a Channel of lines. Thus, while we should keep the basic compatibility code, going forward the interface should perhaps be along the lines of

async for line in LineChannel(some_stream):
    …

@asvetlov Admittedly, I was called out for snarky comments by @njsmith, and I do apologize for that, it wasn't constructive.

smurfix commented 5 years ago

In any case, after taking a day off to hack on the code, I've got an initial split-up of it all up and running. Rough code, heaps of compatibility hooks to get the tests working, no docs yet, but tests pass – and Trio's SSL code works with next to no changes.

Available at github.com/smurfix/cpython, "streams" branch.. Will try to work on the code some more later this week.

njsmith commented 5 years ago

IOW my recommendation is to give in to tradition (even if it's not unanimous), and use Stream and Channel[T]. Stream is coming from a long tradition (e.g. the libc manual uses the word without explanation, instead focusing on the confusion between streams and files inherent in the naming choices of the early C stdio library). Channel may be even older, originating in CSP (according to Wikipedia).

There's definitely a strong tradition of using both "stream" and "channel" to refer to all kinds of incremental point-to-point communication. In fact I just checked the original CSP paper, and Hoare actually uses both "channel" and "stream" about equally often, and with no explanation for either – basically he uses "channel" when he's talking about a connection between two processes, and "stream" when he's talking about the data flowing over that connection. Here's some examples – in both these cases he says "stream", but it's taken for granted that the stream is represented as a CSP channel:

image

In a system like traditional asyncio, a generic name like "stream" is a good choice! There's only one thing it could refer to, so there's no confusion. Ditto for using "channel" in CSP or Go.

But, when we put two different kinds of point-to-point communication in the same system, and try to distinguish them using two words that both refer generically to point-to-point communication, then that's where my problem starts, because now we have to change the meaning: we're redefining "stream" to mean "NOT objects", and "channel" to mean "NOT bytes" [1]. As far as I can tell, Trio is the only system that's ever done this – it's novel. And it contradicts a lot of common usage.

In Trio right now, Hoare's clear, ordinary-looking language is just... wrong. When we talk about SSH channels, we have to be careful to clarify that these channels aren't channels. And we can't say things like "WebSocket enables streams of messages on top of TCP", or "TCP is the protocol that guarantees we can have a reliable communication channel over an unreliable network" – Trio's use of stream/channel turns these ordinary sentences into something completely misleading, and this is already one of the places where new users get the most confused.

[1] This is a classic thing in human language – word meaning is all about contrasts, and when you add new words, the existing words shift their meanings around to make room.

A third way?

That said, I can also see that no-one's convinced by my idea of distinguishing ByteStream and Stream[bytes] :-). So let's step back. AFAIK there are basically three options for how to break up these concepts:

That last option is seen in nodejs, and in @glyph's current tubes work; in trio-land we had a whole thread on it in #959. We rejected it back then, but maybe it's time to revisit.

The major challenge is that framed and unframed streams have really different behavior, so you need some way to keep track of that distinction, communicate it to users, make sure you don't accidentally pass an unframed Stream[bytes] into code that expects a framed Stream[bytes], etc.

@glyph's way of handling this is to define two different subtypes of bytes, to represent framed and unframed data – so in his notation, an unframed stream is a Flow[Segment], and a framed stream is a Flow[Frame]. Currently this is accomplished via some kind of zope.interface-based magic, but in the future he's planning to switch to using PEP 484 NewType to make Segment and Frame into static-only subtypes of bytes.

This doesn't seem super appealing to me for two reasons. First, you don't get any runtime checking. Beginners are the ones most likely to mix these up, and also the least likely to have fancy stuff like mypy set up. And second, even if you do have mypy running, I think to make it work you'd have to write explicit casts everywhere whenever you wrote any networking code, which seems annoying.

Here's another idea, that's not fully baked but I think might be promising. What if we define a generic base class, and define two empty sub-interfaces to use as markers for the two confusable variants:

class Stream(Generic[T]):
    # ...

class FramedByteStream(Stream[bytes]):
    pass

class UnframedByteStream(Stream[bytes]):
    pass

We can bikeshed the names of course. These ones are kind of wordy... but that seems OK, because these are abstract types that only show up in documentation, type signatures, and isinstance checks – i.e., they're optional, and you only use them when you want to go out of your way to be explicit:

class LengthPrefixFramer(FramedByteStream):
    "Converts an `UnframedByteStream` into a `FramedByteStream`."  # <-- docs
    def __init__(self, transport: UnframedByteStream):  # <-- static type checks
        if not isinstance(transport, UnframedByteStream):  # <-- runtime type checks
            raise TypeError
        # ...

The rest of the time you get to use both types with regular bytes objects (or bytearray, etc.) and no extra ceremony or runtime cost. And asyncio streams get to stay as streams without creating any terminology conflicts. And it's even convenient for @glyph if he wants to hook up his tubes infrastructure to the new ABCs.

What do y'all think?

Other stray thoughts

An oddity is that according to the Liskov substitution principle, FramedByteStream ought to be a subtype of UnframedByteStream, because framed byte streams make strictly stronger guarantees than unframed ones. In practice, I don't think we need to care about this – if someone really wants to do something weird like run TLS on top of a Websocket, they can always force it through some casting, and the rest of the time treating them as separate types will help avoid silly mistakes.

In this approach, trio.StapledStream, which staples together two unidirectional streams to make a bidirectional one, automatically works for both framed and unframed streams, which is nice – otherwise we'd need some boilerplate code to define StapledStream and StapledChannel separately. But, there is some question of how to track the framed/unframed distinction – a StapledStream object is a FramedByteStream if both of the underlying streams are FramedByteStreams, and an UnframedByteStream if both of the underlying streams are UnframedByteStreams. I guess we could make isinstance work by adding a special case to __subclasscheck__, or playing some games with StapledStream.__new__. Propagating this statically isn't really doable with Python's current static types, AFAIK.

I also thought about having .framed attribute on Streams, that's only meaningful for byte-streams. But that doesn't let us express the framed/unframed distinction statically, and it feels like we want to be able to express it in type signatures.

njsmith commented 5 years ago

Hmm, here's another limitation of the Stream[T]/FramedByteStream/UnframedByteStream idea: it makes it difficult to express a generic message transport. E.g. for passing messages between tasks within a single process, Trio has a MemoryChannel[T]. Since it's within-process, you can pass objects directly, so T can be any Python type. And that means that you can have MemoryChannel[bytes]. Obviously this should count as a framed transport. But in the Stream[T]/FramedByteStream/UnframedByteStream approach, it doesn't, and I don't know how to fix that in a way that would make mypy happy. (Well, except for writing a plugin to special-case this, but that's not very satisfying.)

I guess the cause is that in this design, unframed streams are the odd-one-out – they don't quite behave like generic streams. One way to patch over this case would be to drop FramedByteStream, so we just have Stream[T] + UnframedByteStream. And at runtime, we could still catch the same errors, by replacing checks like if not isinstance(obj, FramedByteStream): raise TypeError into if isinstance(obj, UnframedByteStream): raise TypeError. But... there's no way for a type signature to say "I don't want an UnframedByteStream" – you can write somearg: Stream[bytes], but that allows UnframedByteStream through. And that's exactly the main mistake that we want to prevent, so we want mypy to be able to catch it, so this patch doesn't seem very attractive.

Another idea: we could push the framed/unframed distinction up into the generic layer, so instead of just Stream[T], there's UnframedStream[T] and FramedStream[T]. Then we could make MemoryChannel[T] a FramedStream[T]. But this is pretty weird, because in practice there is no generic UnframedStream[T], there's only UnframedStream[bytes], and now we've looped back around to the ByteStream/Stream[T] design that we were trying to find an alternative to in the first place.

What if we made it ByteStream and MessageStream[T]? MessageStream[bytes] makes it a lot more explicit that this is a framed transport than Stream[bytes], which I think addresses some of the concerns that folks have raised about my original proposal. And the names are a bit wordy, sure, but now we can say "stream" when it's obvious from context which kind of stream is meant, without risking giving the wrong idea, and you only need to get wordy when you want to be explicit about which kind of stream you mean. And Python even has a tradition for making protocol names a bit wordy – think of Mapping[K, V] versus dict.

smurfix commented 5 years ago

What if we made it ByteStream and MessageStream[T]?

You could also have a [Unicode]CharacterStream … which is even wordier. Or an IntegerStream or a FloatStream, which means that we're right back where we started. Well … except for the fact that ByteStream is special – because our bytes is really a Tuple[bytes].

Also, Mapping might be a long word, but at least it's just one word. Simple (or not-so-simple-but-somewhat-ubiquitous) concepts should be designated with simple (or at least single) words. When I read code I want names that can be recognized at one glance, not three. (Writing is easier: the editor has macros for that. :-P )

Let's face it, there do not seem to be any choices that are (a) intuitively obvious without explanation (b) used the same way in some sort of majority of other languages (c) short enough to not be too verbose. Thus I move that we go with two simple words, i.e. we follow Guido's recommendation and use Channel (single objects) vs Stream (multiple instances of a primitive like byte, without framing). At least for these we can come up with a mnemonic or two, to help people who are new to this, to distinguish them. They'll quickly become intuitive anyway.

Thus my hypothetical audio processor would have a bunch of Stream[float] instances for the filters to talk to each other. One or two of these streams would get encoded into a Channel[OpusFrame], which ends up in a Stream[byte] (or a ByteStream because, well, bytes are somewhat special) that's sent to the client.

Frankly, I'd rather not discuss the names of these concepts for another week or two. (Which is not to say that the discussion so far has been unhelpful. Quite the opposite. But there's diminishing returns …) I'd rather start write a bunch of helper classes to translate between different types of streams and/or channels which happen to work with both Trio and asyncio.

Before getting there, we also have another naming problem, which is read/write/close vs. receive_some/send_all/aclose (for streams) and possibly read/write vs. receive/send vs. get/put (for channels; also, is there a close, or do I send None to signal The End?). While wrapper classes to translate one to the other are somewhat trivial to write, having to deal with these wrappers at all, indefinitely, is really really annoying IMHO.

smurfix commented 5 years ago

… no takers? Everybody afraid to step on somebody's toes / strong opinions WRT these names? (I would be, if I were to be the one to decide on them … but somebody will have to.)

Alternately we could simply agree to disagree and use a translator.

As an example, the code that adapts Trio's SSL stream to an asyncio Stream which I used as a proof-of-concept in my asyncio.stream split-up exercise basically looks like this:

# common code in asyncio.streams or wherever
class _TrioWrap(AbstractStreamModifier):
    """a wrapper to translate trio calls to asyncio"""
    def __init__(self, wrapped): self._wrapped = wrapped
    async def send_all(self,data): await self._wrapped.write(data)
    async def receive_some(self,n=4096): return await self._wrapped.read(n)
    async def aclose(self): await self._wrapped.close()

class _TrioUnwrap:
    """A mixin to translate asyncio calls to trio names"""
    async def write(self, data): await self.send_all(data)
    async def read(self, n=None): return await self.receive_some(n)
    async def close(self): await self.aclose()

def wrap_for_asyncio(TrioStream):
    class WrappedStream(_TrioUnwrap, AbstractStreamModifier, TrioStream):
        """This class wraps a Trio SSLStream to run on asyncio"""

        def __init__(self, transport_stream, *a, **kw):
            super().__init__(lower_stream=transport_stream)
           TrioStream.__init__(self, _TrioWrap(transport_stream), *a, **kw)
    WrappedStream.__name__ = TrioStream.__name__ + "_asyncio"
    return WrappedStream

# asyncio application / library code
AsyncioSSLStream = wrap_for_asyncio(Trio_SSLStream)

async def communicate(…):
    async with connect(…) as my_stream:
        my_stream = AsyncioSSLStream(my_stream, SSLContext(…), …)
        await my_stream.write(b'Hello!\r\n')

… and while that's basically it – you now have an encrypted my_stream, you do not need to add any SSL-related options to each and every Streamish class out there – the overhead doesn't strike me as particularly attractive. Nor does the semantic mismatch for read(no_argument) (and there probbly are others), which this code doesn't yet(?) try to paper over.

1st1 commented 5 years ago

… no takers? Everybody afraid to step on somebody's toes / strong opinions WRT these names?

No, not really. Since we're reverting the new streams api from 3.8 we'll start a new discussion on what's the new api should look like for asyncio 3.9.

Future compatibility with Trio is desired so I suggest not to settle on any design before we have a chance to discuss it with asyncio devs.

I'll start a thread sometime next week, right now we're busy releasing RC1. Not saying this to discourage the discussion here, quite the contrary it should continue! Just a heads up explaining why Andrew and I are quiet for now.

njsmith commented 5 years ago

A bunch of discussion about asyncio.Stream happened over here: https://bugs.python.org/issue38242 To summarize the parts that are relevant to this thread:

@smurfix

Thus I move that we go with two simple words, i.e. we follow Guido's recommendation and use Channel (single objects) vs Stream (multiple instances of a primitive like byte, without framing). At least for these we can come up with a mnemonic or two, to help people who are new to this, to distinguish them. They'll quickly become intuitive anyway.

That's the problem though. Everything you're saying seemed plausible a priori, but then we actually did the experiment, and it didn't become intuitive for me. I'm still willing to hear arguments for this approach if folks have them, but it's a very inconvenient fact, and I'd appreciate it if folks arguing for Stream/Channel engage could with that somehow, instead of acting like it never happened.

That said: Having slept on it for a few weeks, I'm still very happy with the ByteStream/MessageStream[T] idea; much more so than my original proposal that I made at the start of the thread. Thanks to the discussion here, I feel like I finally managed to articulate why I have so much trouble with the Stream/Channel approach: the way that when someone says "stream" speaking casually, it becomes confusing whether they mean "specifically bytes, not objects" or "any kind of stream". And that also showed that my ByteStream/Stream[T] doesn't fully solve the problem, because with that when someone says "stream" they might mean "specifically objects, not bytes", and you can't tell.

So the advantages of ByteStream/MessageStream[T] are:

The one downside is that it's a little more verbose than other options, but given that these are abstract types that you only need to spell out in full in contexts where you want the extra precision, I think we can live with that. All the options we've come up with require some compromises, and I'm much more comfortable with that compromise than the alternatives.

If folks are comfortable moving forward with that, then I think the next steps will be:

gvanrossum commented 5 years ago

I really don't want to be the decider here, and I don't have any evidence beyond my own feelings. I also don't recall the "experiment" you have already done -- if it was so compelling, would you mind providing a reference? (I do recall reading about it before, but I don't recall where, and I haven't kept track of this discussion since the last time it flared up.)

njsmith commented 5 years ago

@gvanrossum No worries! We all value your thoughts, but there's definitely no obligation to make a pronouncement or anything. We'll figure it out :-).

And I was speaking a bit loosely with the "experiment". I don't mean we herded a bunch of undergrads into cubicles and made them answer questionnaires, but rather that we've actually been shipping the Stream/Channel names in Trio for a full year now. We went with it based on arguments like the ones we've seen here, and I've tried to train myself to use the terms consistently, but the fact is that even now I still find them awkward and unintuitive. So when someone argues "don't worry, it'll quickly become intuitive", well, if it was going to happen quickly it should have happened by now! So I just can't find that argument convincing, no matter how much I want to believe.

gvanrossum commented 5 years ago

So it's your word against mine, really. :-) -- --Guido (mobile)