Closed lmmx closed 3 years ago
from range_streams.codecs import PngStream
from range_streams import RangeStream
def check_png_has_alpha(url: str) -> bool:
try:
p = PngStream(url=url)
return p.alpha_as_direct
except Exception:
return False
data_sample_url = "https://storage.googleapis.com/gresearch/wit/wit_v1.train.all-1percent_sample.tsv.gz"
s = RangeStream(url=data_sample_url)
s.add(s.total_range)
b = s.active_range_response.read()
d = zlib.decompress(b)
I will want only the partial content (byte range
[16,29)
corresponding to the IHDR chunk) and while an asynchronous loop may be fast, I’ve previously (inimpscan
) tried and failed to combine these through a synchronous class, when really I think the async aspect cannot compose well with a pre-existing synchronous implementation.impscan.conda_meta.async_utils
for an idea of what this async loop would look likeThe check itself is very simple, since
range_streams
implements all of the PNG handling itselfUnless the speed is a serious barrier I suspect that just running it serially, perhaps with multiprocessing, would be best.
For a large number of requests, it may also be wise (or perhaps unwise if there are limits on one client?) to reuse the same client during the session. This is passed in as
PngStream(url=url, client=client)
but otherwise no changeI thought I’d made a note on this but now can’t find it in the repo: I will use the Google Research dataset (which shares the name of an OpenAI dataset?) WIT