Use only 1 HEAD and 1 streaming GET request in `RangeStream`

lmmx commented 3 years ago

The existing interface for RangeStream is fine, but what it does internally should change. There should be only one streaming GET request, and all ranges added to this stream will not have their own RangeResponse but a simulated version derived from modifying the original, total range.

This is motivated by the discussion for PngStream in #28 in which I noticed the slow performance of the chunk populating method which relies on sending range requests for the next chunk after determining their location:

You can imagine rewriting it by simply changing what the add method does (split off a subdivision of the total range) and what happens upon initialisation if no range is passed in (a streaming request for the total range instead of a head request).

This was essentially a misunderstanding in the design phase of how to use these streams, and should be fairly simple to change over since it’s tested pretty thoroughly.

The only weirdness may be from having a single response… so maybe that should be ‘faked’ by changing the values and making them as copies…? Unclear if worth the memory toll though

Expect will need some mechanism to ‘gate’ ranges so reading past the end behaves as expected

For testing, perhaps switch over gradually with a flag in the init (singlestream=False) that gets toggled to True once stable and eventually removed in future versions?

lmmx commented 3 years ago

Oops… never mind, I did already consider it

As the motivation docs page says, I did already consider this and had forgotten the reason I decided against it:

If you use a single stream, then you’re required to load it linearly. You can’t skip on a single stream, seeking later than the cursor loads all bytes up to it.

The current design permits non-linear reading (which is costly for e.g. ZIP central directories but not for linear PNGs).

This is helpful to clarify and can be documented in the motivation page.

I will need to add this as a reading mode then, so all the previous code can stay as it is as long as I leave the current mode as the default.

This will entail some ‘hackyness’ (essentially having to fake the interface to add and active_range_response) but it’s worth not having to rewrite code that already supports this, and leave it up to the user to decide which way they want to use the library (which may depend on aspects as simple as file size)

lmmx commented 3 years ago

Rather than simply overwrite the add method wholesale when single_request is True in the params of RangeStream, I suggest adding a new method divide, which can be used in this case, and also be available for less simple use cases (such as "single request for the first N bytes of a file but new requests for any range after that")

Click to show humming and harring about what to call this method...

- `add` is a fitting name for a method that supplies a particular range on a stream, but `divide` does not accurately suggest the provision of a sub-range from a pre-existing range. - Better names for this method: - `partition` (but to me this suggests producing multiple sub-ranges rather than one) - `split` (same problem as above) - `subset` - "Subset" is too specific though (and more of a noun than a verb) - "Cut" may be neater, but sounds like it mutates its target (as does "trim"). - "Excise" has a clearer meaning than "cut" and does not suggest it mutates its target - From looking it up, apparently "excise" does sometimes mean mutating its target, but it's better than "extract" (as in 'take/draw out') which is a bit too generic of a term - I would choose either: - ~~`excise`~~, - ~~`snip`~~, - ~~`clip`~~, - ~~`snippet`~~, - ~~`withdraw`~~

window (as in sliding window)

The matter of whether the target is mutated is relevant here: the underlying RangeResponse will have a different range but the same httpx.Response (in RangeResponse.request.response, i.e. :attr:RangeRequest.response)

The RangeResponse gets its range (solely in the is_consumed method) through its RangeRequest.range

https://github.com/lmmx/range-streams/blob/e6277ece8dd07d5eff61a03461db4e01e5dbca42/src/range_streams/response.py#L146

This could therefore be ‘gated’…

to permit ‘marking’ (distinct from tail marking on the RangeResponse itself, which is to do with keeping ranges distinct in the ranges RangeDict)?

The idea of a "window" on the pre-stream makes clearer that it is not the same type of add, and can be facilitated with either a flag add(stream_window=False) or a new method add_stream_window

The name stream_window is equally concise as RangeStream and clear about the abstraction (a single streamed GET request) and the unit of computation (a mutable window onto an immutable underlying response)
- monostream_window would be clearer, but would also limit the extensibility of this method: in future I may remove the requirement for a window to be on a single stream over the entire file range.

Rather than storing the single stream separately, it should just be made clear in the documentation that a single stream should be used deliberately to avoid sending further requests, and therefore add should be called with stream_window=True

(I may abbreviate it to simply window=True in fact, the class itself is a stream so it doesn't need repeating)

lmmx commented 3 years ago

Implemented with use_window=True, it all seems to work perfectly.

lmmx / range-streams

Use only 1 HEAD and 1 streaming GET request in `RangeStream` #29