swift-server / async-http-client

HTTP client library built on SwiftNIO
https://swiftpackageindex.com/swift-server/async-http-client/main/documentation/asynchttpclient
Apache License 2.0
927 stars 117 forks source link

`HTTPClient.execute(...) async throws` API violates Structured Concurrency #752

Open weissi opened 4 months ago

weissi commented 4 months ago

Structured Concurrency mandates that upon the return of a function any background work it may have started has completed.

With HTTPClient.execute(...) async throws that's clearly not the case as it returns a half-finished HTTPClientResponse where the response.body yet has to arrive.

FranzBusch commented 4 months ago

Yes, I brought this up as well. The async APIs are not structured concurrency "conform". We spawn an unstructured Task inside them and that task is "driving" the request. This leads to exactly this problem. We need to provide a HTTPClient.execute(request: body:) scoped method which we can have overloads for that just accumulate the whole body into a single buffer (with size limits for security)

weissi commented 4 months ago

Exactly, they should be called with... for consistency but httpClient.withExecutingRequest(...) { response in for try await chunk in response.body { ... } } sounds fine

fabianfett commented 4 months ago

I'm not sure I agree here. In structured concurrency multiple ways can lead to Rome! The question is how do we structure the concurrency:

In AHC the execute function aligns from request head to response head. If you cancel the execute the request is cancelled. If you drop the response body the request is cancelled and the body won't be loaded. The same is true if you drop the AsyncIterator eventually. So I think AHC is totally structured. It might be structured differently than what you expect but it is structured.

I can totally see that an additional with style API can make sense for streaming request and response bodies at the same time. But I'm not convinced we should make this the default.

FranzBusch commented 4 months ago

In AHC the execute function aligns from request head to response head. If you cancel the execute the request is cancelled. If you drop the response body the request is cancelled and the body won't be loaded. The same is true if you drop the AsyncIterator eventually. So I think AHC is totally structured. It might be structured differently than what you expect but it is structured.

So I think we have two orthogonal problems here. Currently AHC spawns an unstructured task in the async APIs and that leads to http request to continue running in the background even though you might assume it is done. The other problem is being able to strictly tell when an http request is done. When using a with style API you can 100% guarantee when that is done. When returning an AsyncSequence it might or might not be alive until the sequence is dropped or fully consumed.

Now we can make AsyncSequence returning APIs work by using types such as the NIOAsyncSequenceProducer to communicate between the sync/ELF world and the async world. In fact we have already in postgres-nio and NIO itself.

AsyncSequence has its own problems though it is modelled as being a blueprint and you instantiate the real production once you call makeAsyncIterator. This just isn't true for what we are doing here or in most other cases. When we create the AsyncSequence we are already executing the request in most cases so in the end we could live with just having an iterator and not the sequence. That sadly doesn't compose.

Fundamentally, AHC is offering a bi-directional streaming API. This itself requires a with style approach to couple the outbound and inbound lifetime together. Something like the code below shows a potential pseudocode fundamental API.

client.execute(request: HTTPRequest) { outboundWriter, inboundStream in
  try await outboundWriter.writeBody(data)
  try await outboundWriter.writeEnd(trailers)
  for try await inbound in inboundStream {
     ..
  }
}

Now we can offer convenience APIs on top of this. One that just takes a fixed outbound body in terms of some Sequence<UInt8> or some AsyncSequence<some Sequence<UInt8>, any Error>. Similar for the output. I would recommend taking a look at what gRPC did here https://github.com/grpc/grpc-swift/blob/main/Sources/GRPC/Docs.docc/Proposals/0001-stub-api.md

weissi commented 4 months ago

I'm not sure I agree here. In structured concurrency multiple ways can lead to Rome! The question is how do we structure the concurrency:

Not quite. Structured Concurrency doesn't mean you get to pick the structure, the structure of your code (the { and }s).

Quotes from

Wikipedia on Structured Concurrency

The core concept is the encapsulation of concurrent threads of execution (here encompassing kernel and userland threads and processes) by way of control flow constructs that have clear entry and exit points and that ensure all spawned threads have completed before exit.

SE-0304:

[...] child-tasks and respected the primary rule of structured concurrency: that a child task cannot live longer than the parent task (or scope) in which it was created.

So the one thing that mustn't happen is that you pass your scope's } and still have stuff running. Very clearly, that's exactly what execute is currently doing.

In AHC the execute function aligns from request head to response head. If you cancel the execute the request is cancelled. If you drop the response body the request is cancelled and the body won't be loaded. The same is true if you drop the AsyncIterator eventually. So I think AHC is totally structured. It might be structured differently than what you expect but it is structured.

It may have a structure but it's not the structure that Structured Programming and Structured Concurrency mandate.

All the bad stuff that's supposed to be impossible with Structured Concurrency can happen with execute.

For example:

For example if you do

var baddieGlobalResponseBag: NIOLockedValueBox<[HTTPClientResponse]> = .init([])

func myFunction() async throws {
    let response = try await HTTPClient.shared.execute(...)
    baddieGlobalResponseBag.withLockedValue { bag in
        bag.append(response)
    }
    return // This return not wait for the work, it becomes uncancellable too
}

And yes, my program is stupid and wrong. But the thing about structured concurrency is that this shouldn't matter. The compelling thing is that I'm meant to get a 'guarantee' that when I returned I will have cleaned up (or still waiting for it), HTTPClient.execute breaks this.

If you will, you can liken this to structured programming. If you write

while true {
   print("hello")
   return
}
print("bye")

then you know that it'll print hello once and then bye. You don't need to read some docs to see if maybe the loop continues after the function returned.

And this is despite the fact that the loop condition (true) never goes false and despite the fact that I never break the loop. The loop still stops. Structured Concurrency is meant to work the same way.

The loop continuing after you hit the enclosing } is just not possible because in Structured Programming the control flow follows the structure of your code (the { and }s).

FranzBusch commented 4 months ago

So the one thing that mustn't happen is that you pass your scope's } and still have stuff running. Very clearly, that's exactly what execute is currently doing.

I am not disagreeing just want to add to this. One of the intrinsic problems that we have here is the fact that we have the EL that is capable doing work outside of our task. This in combination with tying the request lifetime to the lifetime of the AsyncSequence leads us to this problem. I agree that if we would be using an explicit with style approach here we wouldn't have that and it becomes very clear when the request ends.

Now what I don't want to say is that it has to be fully structured all the way to the syscall. That's not achievable IMO and also not something we should aim for. The way I have been thinking about clients in Concurrency lately is the following. Each client has a pool of connections (might be 1). Each connection lives in a separate child task originating from a run method. This assumes that the client in question has connections that are reused for more than 1 request. Now if another task comes along and wants to issue a request on the client we have to communicate between the child task that owns a connection and the task that wants to execute the request. We can do this by passing messages between the two tasks using asynchronous sequences.

I really like this documentation from Tokio and think we should have something similar for Swift https://tokio.rs/tokio/tutorial/channels

weissi commented 4 months ago

I am not disagreeing just want to add to this. One of the intrinsic problems that we have here is the fact that we have the EL that is capable doing work outside of our task.

I don't see the problem. The EL, dispatch, any thread or any Swift Concurrency executor is able to enqueue work. The job of an async API is to turn these low-level tools into a structured API.

This in combination with tying the request lifetime to the lifetime of the AsyncSequence leads us to this problem. I agree that if we would be using an explicit with style approach here we wouldn't have that and it becomes very clear when the request ends.

We have a solution and it's with, I understand that AsyncSequence adds issues but here I think they're moot.

Now what I don't want to say is that it has to be fully structured all the way to the syscall. That's not achievable IMO and also not something we should aim for.

Exactly right. This is impossible and undesirable. We do also need escape hatches but much like Unsafe they should be implementation details and not leak into the API like with execute. It's not a problem that Array, Dictionary and ByteBuffer are all implemented with Unsafe. The key thing is that the API is safe.

Each client has a pool of connections (might be 1). Each connection lives in a separate child task originating from a run method. This assumes that the client in question has connections that are reused for more than 1 request. Now if another task comes along and wants to issue a request on the client we have to communicate between the child task that owns a connection and the task that wants to execute the request. We can do this by passing messages between the two tasks using asynchronous sequences.

I disagree. I dislike lateral communication between sibling tasks. Yes, technically it fulfils the requirements of structured concurrency because the run() is running. But really it's still a violation of the principle and it causes trouble with cancellation.

In my opinion, passing messages laterally is awkward, skating on the edge of what the Structured Concurrency model 'allows' and will be slower.

Of course, I understand that connection pools aren't implementable sensibly unless you keep stuff around for longer than strictly necessary but I think this is okay because it's unobservable that a sibling or parent tasks takes ownership of a pre-existing task.

Instead, I think the right model is an explicit ownership transfer from connection pool to child task when a connection is required. And from child task back to connection pool when the connection is no longer required and can be reused later. So yes, there is still lateral communication but strictly this way:

  1. Give me connection/stream
  2. Do work without lateral communication
  3. return connection/stream

I really like this documentation from Tokio and think we should have something similar for Swift https://tokio.rs/tokio/tutorial/channels

Yes, Tokio has pretty decent docs and it'd be lovely to have something comparable for all things Swift.

FranzBusch commented 4 months ago

Instead, I think the right model is an explicit ownership transfer from connection pool to child task when a connection is required. And from child task back to connection pool when the connection is no longer required and can be reused later. So yes, there is still lateral communication but strictly this way:

  1. Give me connection/stream
  2. Do work without lateral communication
  3. return connection/stream

I have been thinking about the lateral communication vs ownership transfer a lot recently and I am not sure if ownership transfer is going to be working in the future due to us wanting to adopt ~Escapable. Let's assume for a second that we have some API that gives you access to a resource like a connection in a scope. Now we really want to make that resource as ~Escapable since it should not be escaped outside the with closure.

struct Connection: ~Copyable, ~Escapable {}
func withConnection(_ body: (Connection) async -> Void) async {}

Now our connection pool would create those connections inside a child task originating from the run method

final class SomeClient {
  func run() async throws {
      await withTaskGroup() { group in
          group.addTask {
            await withConnection { connection in
               // The connection can only live inside this task and we cannot escape it
            }
          }
      }
  }
}

Now if we want to send or receive data we need to communicate into that child task and we cannot transfer the ownership as far as I understand.

weissi commented 4 months ago

I have been thinking about the lateral communication vs ownership transfer a lot recently and I am not sure if ownership transfer is going to be working in the future due to us wanting to adopt ~Escapable.

Wait, wait, who would want to adopt ~Escapable for connections? That doesn't make any sense to me. That makes them useless.

non-escaping (probably) makes sense for stuff like iterators. Stuff that's quick to create/discard, usually wrapping something else.

But doing lateral communication in order to save a few ARCs feels backwards. You'd be trading something super fast (ARC, maybe 2ns per retain/release pair) for something awkard, complex and much slower that's also borderline at odds with the structured concurrency model (lateral comms).

FranzBusch commented 4 months ago

Wait, wait, who would want to adopt ~Escapable for connections? That doesn't make any sense to me. That makes them useless.

non-escaping (probably) makes sense for stuff like iterators. Stuff that's quick to create/discard, usually wrapping something else.

IMO anything that we currently provide in a with style scope makes sense to adopt ~Escapable to get exactly those guarantees enforced by the compiler that we currently only enforce by API shape. ~Escapable will still allow to pass the types into structured child tasks just not outside of the current task. Of course this is mostly hypothetical at this point.

But doing lateral communication in order to save a few ARCs feels backwards. You'd be trading something super fast (ARC, maybe 2ns per retain/release pair) for something awkard, complex and much slower that's also borderline at odds with the structured concurrency model (lateral comms).

I agree that we can transfer ownership for the period of a request for performance reasons; however, lateral communication will always exist in any system. Different tasks have to communicate with each other in larger applications. Doesn't mean we have to use for a connection pool but I still expect it to be a common pattern between tasks and not violate structured concurrency.

weissi commented 4 months ago

IMO anything that we currently provide in a with style scope makes sense to adopt ~Escapable to get exactly those guarantees enforced by the compiler that we currently only enforce by API shape. ~Escapable will still allow to pass the types into structured child tasks just not outside of the current task. Of course this is mostly hypothetical at this point.

Exactly, superficially, this looks possible but is hypothetical. And it may be possible with little wrapper types around the actual, underlying types. As in, there might be APIs that wrap their regular InternalConnection with a struct Connection: ~Escapable. Maybe that enables some features, who knows. I doubt this will actually happen because it's just more awkward for the users. Regardless of if we are going to actually provide ~Escapable types to the users I'm really very sure that we won't use them internally in the implementation. So the transfer in and out of sibling tasks shouldn't and will not be an issue.

But doing lateral communication in order to save a few ARCs feels backwards. You'd be trading something super fast (ARC, maybe 2ns per retain/release pair) for something awkard, complex and much slower that's also borderline at odds with the structured concurrency model (lateral comms).

I agree that we can transfer ownership for the period of a request for performance reasons; however, lateral communication will always exist in any system.

Agree, it will exist but it's awkward, error-prone (and slow). So let's strive to use it as little as possible treating it almost like a half-violation of Structured Concurrency. To me a good litmus test is if it's observable that we did lateral comms. If yes: bad, if no: okay.

Take a connection pool: Yes, it needs lateral communication but a connection pool is an optimisation. We could (of course wouldn't) not have a connection pool and then we could manage connections without lateral comms. This to me is acceptable because the model itself is sound, we just bend the rules a little to get a massive perf boost.

Different tasks have to communicate with each other in larger applications. Doesn't mean we have to use for a connection pool but I still expect it to be a common pattern between tasks and not violate structured concurrency.

Yes, we have also made that observation. And in my experience that's where the bugs come in. I'm pretty sure that 9 out of 10 use cases for lateral comms can be dealt with with a handful of helpers. One of the examples that we commonly have: One task accepts a connection and handles another resource through the connection.

Where the problem comes in: What if the connection disappears? For many resources just tearing it down is the correct and easy to implement answer. But for other resources you want a client to be able to re-connect within a grace period (maybe 1 minute) such that network glitches don't immediately require you to redo everything. I'm thinking this can be handled the following way: The original task notices the client going away. It now moves the open connection into some special 'connection parking lot' type (this is lateral communication) which makes sure that any connection parked is either picked up within 1 minute or it'll tear it down. Once again, this is pretty much a performance optimisation and I think it's okay.

At least in my experience there are only so many (good) reasons to communicate laterally and I'm confident that we'll build a few helper types that take the awkwardness away. (very) Pseudo code with the parking lot could look like this:

func newConnectionHandlerFactory(...) async throws { ... }

ParkingLot(gracePeriod: 1min).withParkingLot { connectionParkingLog in
    for newConnection in try await server.connections {
        try await connectionParkingLog.withParkedOrNewConnectionHandler(newConnFactory: newConnectionHandlerFactory) { handler, connection in
            try await handler.attachAndHandle(connection)
        }
    }
}

Key thing here is that withParkedOrNewConnectionHandler either creates a new handler and makes that handle the new connection. Or it unparks an existing handler (from a previously lost connection) and 'attaches' the connection to that.

Of course, there's some lateral comms going on here but it's entirely handled by the "parking lot" thing. That is easily testable and a re-usable component.

And the first implementation of the parking lot could just be to always pop out a new handler. That also fulfils the litmus test of the model working without lateral comms too (just slower and more wasteful).

FranzBusch commented 3 months ago

I thought about this some more on the weekend and remembered why so far I decided against scoped ownership transfer. It was due to how cancellation works.

The problem that I encountered was that the connection's inbound and outbound both implemented cancellation by basically closing the connection or rather making the connection unusable. Especially the inbound async sequences are terminating when the consuming task got cancelled. Now that's something we can change but it is an important caveat when sharing ownership. It might also not always be the case that you can modify the underlying asynchronous interfaces that are shared in a way that cancellation of the task doesn't close the underlying resource.

weissi commented 3 months ago

I thought about this some more on the weekend and remembered why so far I decided against scoped ownership transfer. It was due to how cancellation works.

The problem that I encountered was that the connection's inbound and outbound both implemented cancellation by basically closing the connection or rather making the connection unusable. Especially the inbound async sequences are terminating when the consuming task got cancelled. Now that's something we can change but it is an important caveat when sharing ownership. It might also not always be the case that you can modify the underlying asynchronous interfaces that are shared in a way that cancellation of the task doesn't close the underlying resource.

That's a valid concern but that's an implementation issue that's easily fixed. Either by changing it or by wrapping the transferred thing with something that creates its own cancellation scope.

Remember, the connection pool is an optimisation, the model should be consistent with how it'd work if we created (and closed) a connection per scope. And ofc, lateral comms make things slower than necessary and we're already in a bit of a pickle (#756).

FranzBusch commented 3 months ago

I agree that this is something the underlying async sequence can change; however, I don't think you can wrap it. The problem is if the user's task gets cancelled and you are currently suspended on a next call even if you create your own cancellation scope you have to cancel that next call. We just need to make sure that the underlying building blocks do the right thing.

weissi commented 3 months ago

I agree that this is something the underlying async sequence can change; however, I don't think you can wrap it. The problem is if the user's task gets cancelled and you are currently suspended on a next call even if you create your own cancellation scope you have to cancel that next call. We just need to make sure that the underlying building blocks do the right thing.

I think it's possible by wrapping (such that you get to make the cancel in next call hand back a connection instead of closing it) but I agree that this must be a wholesome effort. It's actually quite complex, for example HTTP/1 connections will likely not be reusable and should be closed (or else we need to drain some remaining bytes which might take forever).

But don't get me wrong, I'm not saying it's easy. I think it's pretty difficult and we'll need a bunch of rounds of experimentation before getting it right

FranzBusch commented 3 months ago

I think it's possible by wrapping (such that you get to make the cancel in next call hand back a connection instead of closing it) but I agree that this must be a wholesome effort. It's actually quite complex, for example HTTP/1 connections will likely not be reusable and should be closed (or else we need to drain some remaining bytes which might take forever).

Yes I agree. I think generally if a task gets cancelled while using a borrowed resource such as an H1 connection. The default is that we have close and throw away that resource. There might be some optimisations where we can keep of the state to determine that even in the event of cancellation the resource was left in a reusable state.

weissi commented 3 months ago

I think it's possible by wrapping (such that you get to make the cancel in next call hand back a connection instead of closing it) but I agree that this must be a wholesome effort. It's actually quite complex, for example HTTP/1 connections will likely not be reusable and should be closed (or else we need to drain some remaining bytes which might take forever).

Yes I agree. I think generally if a task gets cancelled while using a borrowed resource such as an H1 connection. The default is that we have close and throw away that resource. There might be some optimisations where we can keep of the state to determine that even in the event of cancellation the resource was left in a reusable state.

100% yes, as an optimisation. And I believe all of this should be possible.

But I do acknowledge that if the framework adds zero extra code to compensate, we'll kill the resource. And that's a feature IMHO: If the framework didn't think about it we must not reuse. Imagine a stupid HTTP framework accidentally returning a HTTP connection that's currently request/response body streaming into the connection pool. That would potentially be a security vulnerability. tl;dr: default: kill all resources, optimisation: potentially return resources is good