Consider supporting streaming zip file descriptors

scosman / zipstreamer

Zip File Streaming Microservice - stream zip files on the fly

MIT License

120 stars 18 forks source link

Consider supporting streaming zip file descriptors #35

Closed danlamanna closed 1 year ago

danlamanna commented 1 year ago

For larger zip files it can be prohibitive to provide the entire descriptor in one shot. It would be nice if the descriptor could be paginated in some way. The use case here is running a traditional prefork app on Heroku where 30 second time limits apply.

scosman commented 1 year ago

How big of a descriptor are you generating? I imagine this would need to be a massive number of files to get anywhere near the 30s Heroku limit.

Also: correct me if I'm wrong, but I thought the Heroku 30s reset when we send a byte? Once generated, we stream as fast as the client can handle. So it would only trigger a timeout if the descriptor generation phase is > 30s, which doesn't seem possible. Golang is pretty fast and could generate even massive descriptors sub-second. If anything, there's a small memory hit here if a client very slowly downloads a huge descriptor.

Can you provide some more details of what you're doing and the issue encountered. I understand the ask, but I don't quite follow how this could be causing issues.

scosman commented 1 year ago

Closing for now, since I don't know how to reproduce a non-streaming descriptor (minus the period we're generating, but that's a few milliseconds). If I'm missing something about the issue please let me know some more details and can re-open!

danlamanna commented 1 year ago

Sorry, my original description is sorely lacking.

I have an API (not written in Go) on Heroku that is serving the zip descriptor. This does take a while to generate - it searches records, filters permissions, signs urls, and serializes the entire structure to JSON - ultimately returning a descriptor with ~10k-200k elements. The 30 second timeout is prohibitive in our case because of the JSON serialization step, we can't serve a single byte until we have the entire data structure serialized.

I think the ideal scenario would be a streaming descriptor format like jsonlines or something, where zipstreamer could start fetching/serving bytes before it's received the full descriptor. But for now I've hacked a fork to use a paginated descriptor to support iterating through the files in chunks of 1,000 and it works decently well.

Does that make more sense?

scosman commented 1 year ago

Got it. I think you can still do this as is.

Change your descriptor serving code to serve valid JSON in a streaming fashion. Some JSON libraries might not handle it, others do, and you can always hand code given how simple it is. That way zipstreamer reads bytes as they are ready, and their heroku connection is kept alive. Go might not process it until the end, but it will read it.

Pseudocode:

output.Write '{"suggestedFilename": "tps_reports.zip","files": ['
for int i; i < 10000;i++ {
  if i != 0 {
      output.Write ','
  }
  output.Write '{"url":"https://server.com/image${i}.jpg","zipPath":"image${i}.jpg"}'
}
output.Write ']}'

danlamanna commented 1 year ago

This does work, thanks!

Do you have any opinions on a non-atomic descriptor format in general? I see 2 primary benefits:

Performance: the user and zipstreamer don't idly wait for the API server to generate the entire descriptor, resulting in faster downloads
UX: the user sees bytes downloading even before the API and zipstreamer finish their complete dialog, obviating the "is my download working" problem

scosman commented 1 year ago

Performance wise: idle waiting isn't really a perf concern. There's a bit of memory usage, but the same case can happen with a slow client. The fix there would be to use the disk for descriptors, not stream the them from input.

UX: I'm not sure it improves it much. Since we aren't setting the size of the downloaded file when we start (since we don't know it), browsers just show a spinner from the start, not progress a progress bar, and this has the same UX. It would be a bit faster.

Biggest concern: it would be a lot of complexity. I'd have to deal with errors mid stream, streaming JSON parsing, and keeping 3 buffers in sync (descriptor, downloading files, streaming out zips). It could be done but it would be a lot of work. I think the real fix is a faster descriptor source (cache, faster generation, etc).