lihaoyi / test

0 stars 0 forks source link

Replace `java.net.HTTPUrlConnection` with `java.net.http.HttpClient` (500USD Bounty) #655

Open lihaoyi opened 1 month ago

lihaoyi commented 1 month ago

This would fix https://github.com/com-lihaoyi/requests-scala/issues/123, and put us in a better state going forward.

HttpClient was standardized on Java 11, which is probably old enough we can rely on it already

HttpClient provides an async API, and we should expose it as requests.get.async, requests.post.async, etc.. Apart from that, the rest of the API and test suite should be unchanged. All existing tests should pass, with any unavoidable failures or necessary changes in the tests called out in the ticket

This should also resolve https://github.com/com-lihaoyi/requests-scala/issues/123

To incentivize contribution, I'm putting a 500USD bounty on resolving this ticket. This is payable via bank transfer, and at my discretion in case of ambiguity.

ID: 150 Original Author: lihaoyi

lihaoyi commented 1 month ago

I'll take a stab at this now Original Author:nafg

lihaoyi commented 1 month ago

HTTPUrlConnection has you write the request body by exposing an OutputStream, and RequestBlob is designed around this.

OTOH with HttpClient you pass a BodyPublisher, which is an abstraction analogous to RequestBlob other than the headers, that has helpers given you have a byte array or InputStream or other common cases. Otherwise it's based on java.util.concurrent.Flow.Publisher so you can wrap one of those or extend it.

Can I change the RequestBlob API, or do we have to work with it? In particular I'm not sure how to adapt it to a BodyPublisher without either loading it all into memory or starting another thread.

TBH I'm not sure why it has to take something as powerful as Publisher which can have multiple subscribers.

Original Author:nafg

lihaoyi commented 1 month ago

@nafg does HttpResponse.BodyHandlers.ofInputStream do what we want? At at glance it seems to do the right thing - streaming the data and making it available to whoever reads the stream as it downloads. Original Author:lihaoyi

lihaoyi commented 1 month ago

@nafg does HttpResponse.BodyHandlers.ofInputStream do what we want? At at glance it seems to do the right thing - streaming the data and making it available to whoever reads the stream as it downloads.

I guess you mean BodyPublisher (BodyHandler would be for the response body). It does "the right thing" as you say, but I don't know how to adapt a RequestBlob to it (without reading it into memory or starting another thread). Original Author:nafg

lihaoyi commented 1 month ago

As I said, the cleanest implementation would be to completely break the RequestBlob API. (Of course we would still support implicitly adapting the same types of values.) Original Author:nafg

lihaoyi commented 1 month ago

Maybe I could just create a connected PipedOutputStream and PipedInputStream and pass the former to data.write and the latter to BodyPublisher.ofInputStream. But where would write be called? Original Author:nafg

lihaoyi commented 1 month ago

Sorry, I mixed up BodyHandler and BodyPublisher.

If I understand the problem right, it is that RequestBlob exposes a push-based def write(out: java.io.OutputStream): Unit for people to implement, whereas HttpRequest.BodyPublisher.ofInputStream requires a pull-based Supplier[? extends InputStream], and the pull-based approach is fundamentally harder to implement that the original push-based approach and thus incompatible.

Making RequestBlob pull-based rather than push-based is a no-go. Pull-based APIs are much harder to implement than pull-based APIs, and it would mean the Writables coming from e.g. uPickle or Scalatags or other libraries would not be compatible with this API. Making uPickle, Scalatags, etc. support pull-based serialization would be a full rewrite and is unfeasible

Spawning a thread is an option, but an expensive one. What if we instead use the Java HTTPClient's sendAsync, and then use the calling thread to call RequestBlob#write and block on it? The calling thread isn't doing anything anyway when using sendAsync, so it costs us nothing to use it to call .write. That way we don't need to pay the cost of a new thread while the calling thread sits idle

I'm actually not sure exactly how the async concurrency story works with java.net.http.HttpClient. Does it need an Executor, ExecutionContext, or Threadpool of some sort? We need to sort that out as part of this ticket so we can understand what the consequences of this replacement will be Original Author:lihaoyi

lihaoyi commented 1 month ago

I'm actually not sure exactly how the async concurrency story works with java.net.http.HttpClient. Does it need an Executor, ExecutionContext, or Threadpool of some sort? We need to sort that out as part of this ticket so we can understand what the consequences of this replacement will be

Yeah, the builder takes an Executor IIRC Original Author:nafg

lihaoyi commented 1 month ago

Spawning a thread is an option, but an expensive one. What if we instead use the Java HTTPClient's sendAsync, and then use the calling thread to call RequestBlob#write and block on it? The calling thread isn't doing anything anyway when using sendAsync, so it costs us nothing to use it to call .write. That way we don't need to pay the cost of a new thread while the calling thread sits idle

Does that mean that we give it an InputStream that will block until we write to the OutputStream? Not sure I follow.

Original Author:nafg

lihaoyi commented 1 month ago

I'm not totally sure myself TBH. But if you said you can solve the problem by spawning a thread, it seems like you should be able to solve the problem by using sendAsync and taking advantage of the calling thread. Original Author:lihaoyi

lihaoyi commented 1 month ago

Do I need to handle chunkedUpload==false Original Author:nafg

lihaoyi commented 1 month ago

I don't see an equivalent of connection.getResponseMessage Original Author:nafg

lihaoyi commented 1 month ago

@nafg You'll have to dig through the code and tell me what the tradeoffs and necessary changes are Original Author:lihaoyi

lihaoyi commented 1 month ago

Re response message (reason phrase) see https://stackoverflow.com/a/63576759/333643

Some points from there: HTTP 2 and some HTTP 1 servers don't provide it, HttpClient has no API to get it, and it doesn't really have any purpose.

Two options:

  1. Preserve compatibility using a table of strings per status code
  2. Remove the field from case class Response Original Author:nafg
lihaoyi commented 1 month ago

I guess we'll need more breaking changes or fake compatibility: https://stackoverflow.com/q/53617574/333643 (can't control keep-alive) Original Author:nafg

lihaoyi commented 1 month ago

It doesn't let you set Content-Length manually; it should be determined by the BodyPublisher. Which means RequestBlob can't define it among an arbitrary list of headers.

If I can make breaking changes as desired the most natural replacement would be

trait RequestBlob {
  def bodyPublisher: BodyPublisher
  def contentType: String
}

There are less breaking ways to do it though.

I guess that's really the theme of all my questions. Is this supposed to be "preserve the interface, swap the implementation" or is this going to be a V2 designed around HttpClient? If it's somewhere in the middle then I have to consult on all the specifics I guess. Original Author:nafg

lihaoyi commented 1 month ago

I guess that's really the theme of all my questions. Is this supposed to be "preserve the interface, swap the implementation" or is this going to be a V2 designed around HttpClient? If it's somewhere in the middle then I have to consult on all the specifics I guess.

I think it's somewhere in the middle:

  1. Breaking changes are fine, especially ones that the Java 11 HTTP design explicitly decided were low priority enough to not support. Requests-Scala isn't 1.0.0 yet, so we are not promising people that we'll keep things strictly compatible, so it's just a best effort thing and if users need to tweak code in some edge cases to update that's fine.

  2. But integration with uPickle, Scalatags, etc. via geny.Writable should not break, if at all possible. Being able to stream stuff seamlessly between the different libraries is a big deal, and not a capability I would give up lightly. Swapping over to geny.Readable to require it read from an InputStream is not feasible, because Readable is tremendously harder to implement than Writable

At least for now, it seems we have "use sendAsync and use the calling thread to call .write" as a path forward for keeping (2) working. If that doesn't work, then we'll have to reconsider: maybe we don't want to do this at all, maybe we want to pull in some third-party HTTP library, maybe something else. But we can cross that bridge when we reach it Original Author:lihaoyi

lihaoyi commented 1 month ago

I have all tests passing except gzipError which does requests.head("https://api.github.com/users/lihaoyi").

For some reason it returns a 400. I don't know why, and I don't know what it has to do with gzip. Any thoughts? Original Author:nafg

lihaoyi commented 1 month ago

@nafg not sure, you'll have to dig into it to take a look. Maybe try setting up a proxy to intercept the requests and compare the difference between the old and new? Original Author:lihaoyi

lihaoyi commented 1 month ago

Just logging this here to remember it. I used webhook.site to see what the gzipError test was doing different. I saved each as HAR JSON and compared with https://jsoncompare.com/#!/diff/id=a5efdd49952712265bd06d4e8be477cb&fullscreen&sortAlphabetically/ and got this:

image

Original Author:nafg

lihaoyi commented 1 month ago

so:

  1. The old version includes cache-control: no-cache
  2. The old version includes content-length: while the new version does content-length: 0
  3. The old version includes pragma: no-cache
  4. The new version includes transfer-encoding: chunked

I'm guessing the issue (if any of the above and not something more subtle) is (4).

But I still don't know what the purpose of the test is.

Original Author:nafg

lihaoyi commented 1 month ago

Ok it's from c65a2beef8e9ac6792b4ae3433cfeffca45201fc and seems to be part of #95 trying to fix #37.

So basically it's trying to assert that HEAD requests that accept gzipped encoding don't throw an exception trying to read a non-existent body.

Original Author:nafg

lihaoyi commented 1 month ago

So using a BodyPublisher with a fixed length when contentLength is known (as would be correct anyway) fixes the 400.

But it creates another issue, which I think is a bug in the current behavior that now crashes: Content-Length should be the size after compression.

Since there's no way to know the length after compression without compressing it, that basically means that known-content length RequestBlobs must be written into memory, which means there's no point in them specifying their content length at all. They should just say if they should be streamed or not. Original Author:nafg

lihaoyi commented 1 month ago

The gzip + content length thing seems like it might need an API change to make it work.

What if we defaulted to yes streaming + no content length, and provided a Compress.GzipBuffered that did the compression in memory and thus was yes content length + no streaming?

I don't see an easy way to make the decision automatically in a principled fashion, so maybe the best we can do is let the user choose Original Author:lihaoyi

lihaoyi commented 1 month ago

The gzip + content length thing seems like it might need an API change to make it work.

What if we defaulted to yes streaming + no content length, and provided a Compress.GzipBuffered that did the compression in memory and thus was yes content length + no streaming?

I don't see an easy way to make the decision automatically in a principled fashion, so maybe the best we can do is let the user choose

I didn't fully understand that.

I thought we needed an API change anyway and that was okay. I'll implement my last idea in the near future and push my code up. I think that should help us communicate ;) Original Author:nafg