Efficient range stream crawler

I will want only the partial content (byte range [16,29) corresponding to the IHDR chunk) and while an asynchronous loop may be fast, I’ve previously (in impscan) tried and failed to combine these through a synchronous class, when really I think the async aspect cannot compose well with a pre-existing synchronous implementation.

I suggest looking at impscan.conda_meta.async_utils for an idea of what this async loop would look like

The check itself is very simple, since range_streams implements all of the PNG handling itself

from range_streams.codecs import PngStream

def check_png_has_alpha(url: str) -> bool:
  try:
    p = PngStream(url=url)
    return p.alpha_as_direct
  except Exception:
    return False

Unless the speed is a serious barrier I suspect that just running it serially, perhaps with multiprocessing, would be best.

For a large number of requests, it may also be wise (or perhaps unwise if there are limits on one client?) to reuse the same client during the session. This is passed in as PngStream(url=url, client=client) but otherwise no change

I thought I’d made a note on this but now can’t find it in the repo: I will use the Google Research dataset (which shares the name of an OpenAI dataset?) WIT

1% data sample

lmmx / wikitransp

Efficient range stream crawler #1