Line breaks to chunk up long PEP 691 resonses

fungi commented 2 years ago

What's the problem this feature will solve? Some proxies default to limiting the maximum line length they're willing to process (Apache's mod_substitute, for example, refuses to process responses with >1M characters on a line, unless explicitly configured with a higher limit). Warehouse currently returns the JSON simple API response as a single line of JSON, and so can easily exceed this limit on its main index, as well as indices for projects with lots of packages like grpcio, pymongo and moto. This can lead to hard-to-diagnose proxy failures.

Describe the solution you'd like It would be nice if linebreaks were interspersed in very long JSON responses, in order to ease proxying and not require admins to work out additional configuration for covering these cases.

ewdurbin commented 2 years ago

In general, PyPI does not alter its behavior to support any proxy, caching layer, or network downstream of our CDN unless it is breaking a public specification or protocol.

As PEP 691 implementation is rather new, we'll keep this issue open in the event that someone can point to a specification/protocol being broken or if enough users begin to experience such issues with caches/proxies outside of their control.

cboylan commented 2 years ago

`I think it is worth noting that the reason fungi and I run these proxies is to cache responses and reduce the number of requests made to PyPI. We are attempting to be good community members and reduce the load our CI system creates on upstream resources. Previously we had tried to manage a bandersnatch mirror instead, but that became untenable as the popularity of daily releases for very large ML/AI packages took off.

While I agree that changes should be made carefully to avoid unexpected incompatibilities, JSON doesn't care if you add the occasional newline between key: value pairs. The resulting data structures are equivalent, but one is far easier to process with tools used to reduce load on the system. We are not asking for any specs or protocols to be changed. Merely that an alternate but equivalent JSON representation be considered.

miketheman commented 2 years ago

Chiming in - I took a look at this briefly.

Changing SubstituteMaxLineLength is a configuration change, and something I'd expect proxy maintainers to handle. Raising a configuration value increases the max memory consumption used during a request, so that's a proxy owner decision, and something almost every Apache-backed implementation of Wordpress has done. 😁
PEP-691 doesn't assert anything specific about the shape of responses other than valid JSON. There is some consideration about the size of responses, but that's related to a filename. The current implementation does currently return all_releases - which bandersnatch relies on, per an inline note. @dstufft did you want to follow up with bandersnatch to see about possibly removing it from the response?
If we decide to change warehouse functionality, this could be accomplished by creating a custom pyramid renderer that adds indentation. A an example implementation could be:
```
# somewhere in config.py
import pyramid.renderers
config.add_renderer('indent_json', pyramid.renderers.JSON(indent=4))
# update (every|relevant) views:
@view_config(
...
    renderer="indent_json",
..
```
(We have 10 view_configs that use json renderer today)

This change would absolutely increase the overall size of the response - on a sample project with 13 releases and a 18 files with a current response size of 15,444 bytes, the added spaces increase the response to 23,868 bytes. (Yes, it compresses well, but it's still considerably larger).

My take is to not make any code changes to change the JSON renderer this way at all due to the size increase, rather guide proxy owners to increase their limits where possible, and to explore reducing our overall response sizes when we can, not increase them.

fungi commented 2 years ago

Of course, configuring the proxy module to accept responses with longer lines was the first thing we did once we figured out the reason CI jobs using new versions of pip were arbitrarily failing to find specific packages. The hope was to spare other proxy operators, perhaps some who have less familiarity with Warehouse's behaviors and the new JSON simple API implementation, from needing to go through a similar experience.

I know nothing about Pyramid, but adding the number of linebreaks you suggest seems like overkill, since having slightly over one linebreak every million characters would suffice for the case we identified, resulting in an approximate 0.0001% increase in uncompressed response size, and only for responses in the realm of a megabyte or more (which is an extremely small subset of projects on PyPI, setting aside the main project index).

Also, these responses are expected to continue growing over time as the lists of projects/releases/uploads increase, so simply setting a higher max line length in our proxy is merely a stop-gap. Whatever value we pick will eventually be reached again, indicating a need for ongoing monitoring and adjustment. I suppose another option would be a feature request to Apache to be able to turn off that line length check, though I do not know if it's there in order to prevent some other sort of misbehavior.

Please bear in mind this is not a bug report, and I'm not seeking suggestions on how to improve proxies. I'm simply hoping that some effort to not make Warehouse any more complicated for proxying than it already is will encourage CI system operators like us not to give up and point our many millions of additional requests a day at PyPI's CDN instead.

dstufft commented 2 years ago

I don't in theory have a problem with adding line breaks, but I'm not sure how to actually do it in a reasonable way. The JSON encoder doesn't offer us an option to emit extra newlines every X number of characters and we can't arbitrarily add them wherever we want because there are only certain places in a JSON response where you can add a newline and be fine.

I can think of a couple of ways of implementing them, but I don't particularly like any of them:

Implement a mini json parser that can scan through the encoded json result and keep track of how many bytes it's scanned since the last new line AND can interpret the JSON response to know when it's safe to emit another one.
Stop using orjson for emitting JSON and switch back to using the stdlib and use the streaming encoder option.

The first option seems like a nonstarter to me, I don't think anyone here wants to try and implement a mini JSON parser, plus it would have performance impact because we'd essentially be scanning the content twice, once to encode it and once to add link breaks to it.

The second option seems like it would be the most doable, but one of the reasons we switched to orjson was because it was much faster at emitting JSON (up to 6x faster in benchmarks when I tested it against live data). As far as I can tell ojrson doesn't have a streaming encoder option or a way to add these new lines except by pretty printing the entire response. Unfortunately, that puts this option in needing to decrease performance of some of our most hit routes to work around a constraint in mod_substitute which feels hard to justify to me.

Do you have other options that I'm unaware of?

Side note: I assume you're using mod_substitute to rewrite the content to point the file URLs back to your own mirror? I wonder if using something like Squid and just setting a http_proxy instead of --index-url so that the proxying is transparent would work? Then you wouldn't need to change the responses at all. This kind of modifying a response in line probably isn't going to work long term unless you disable TUF signatures locally, since modifying the response will cause the signatures to fail.

merwok commented 2 years ago

Is there no option to have orjson add a newline after each object? (of course it would have to be a top-level object, not every nested one — but not that hard to determine!)

dstufft commented 2 years ago

It has an option that is basically the same thing as indent=2 in the stdlib. I'm not sure that I've ever seen a JSON library that offered an option for emitting a new line after top level objects, most of them it's just a knob for pretty printing vs not.

dstufft commented 1 year ago

I'm going to go ahead and close this issue. I believe that the reporters have already worked around this in their own infrastructure, and in the last year it doesn't appear that anyone else has come forward having the same problem to suggest that this is a widespread problem.

pypi / warehouse

Line breaks to chunk up long PEP 691 resonses #11919