lmmx / wikitransp

Dataset of transparent images from Wikimedia
MIT License
0 stars 0 forks source link

Efficient range stream crawler #1

Closed lmmx closed 3 years ago

lmmx commented 3 years ago

I will want only the partial content (byte range [16,29) corresponding to the IHDR chunk) and while an asynchronous loop may be fast, I’ve previously (in impscan) tried and failed to combine these through a synchronous class, when really I think the async aspect cannot compose well with a pre-existing synchronous implementation.

The check itself is very simple, since range_streams implements all of the PNG handling itself

from range_streams.codecs import PngStream

def check_png_has_alpha(url: str) -> bool:
  try:
    p = PngStream(url=url)
    return p.alpha_as_direct
  except Exception:
    return False

Unless the speed is a serious barrier I suspect that just running it serially, perhaps with multiprocessing, would be best.

For a large number of requests, it may also be wise (or perhaps unwise if there are limits on one client?) to reuse the same client during the session. This is passed in as PngStream(url=url, client=client) but otherwise no change

I thought I’d made a note on this but now can’t find it in the repo: I will use the Google Research dataset (which shares the name of an OpenAI dataset?) WIT

lmmx commented 3 years ago
from range_streams.codecs import PngStream
from range_streams import RangeStream

def check_png_has_alpha(url: str) -> bool:
  try:
    p = PngStream(url=url)
    return p.alpha_as_direct
  except Exception:
    return False

data_sample_url = "https://storage.googleapis.com/gresearch/wit/wit_v1.train.all-1percent_sample.tsv.gz"
s = RangeStream(url=data_sample_url)
s.add(s.total_range)
b = s.active_range_response.read()
d = zlib.decompress(b)