Open Zeuh opened 6 months ago
Hi @Zeuh you'll really want to take a closer look at how hibp-downloader
handles the storage of files, in short you very very likely do not need to decompress the data.
Before I get into the details below, the choice of downloading and storing in gzip is a very deliberate choice since nice command-line tools exist for gzip (eg zcat, zgrep) that do not exist for Brotli (yet)
Here's what happens under the hood -
hibp-downloader
tool pulls data in gzip compressed format and writes that compressed stream to disk together with some helpful meta data in-case we want to review that at some later stage.query
if a password is contained in the dataset the hibp-downloader
tool then (1) takes a sha1 of the user-supplied password (2) generates the filename of the file where the password entry should exist, and then (3) decompress that file in-memory and find the line-entry based on it's sha1 if it exists - this process is surprisingly fast and you can handle many requests per second this wayThe compute-time/storage-space tradeoff favors keeping the datafiles natively compressed as-is directly from download and doing a small amount of compute (ie the decompression) when needed.
I can imagine a simple wrapper function that makes it possible to do something like this might be a good thing for you and others
query_hibp_datastore(data_path="/path/to/data", password="password123", hash_type="sha1")
Finally, if you still really, really want to store the data decompressed then you could use the hibp-downloader generate
subcommand that will generate one massive datafile that is similar to the one generated by the original hibp
download tool created by Troy's team.
Thanks for this complete answer and for this tool who is the most efficient and complete :)
I already known how it works but if I would like to build a local clone of the HTTP Pwned Passwords API, have all file uncompressed is more simple, I get the data and give it to fastapi or whatever lib to build the HTTP response (witch will be compressed or not depending http client capabilities).
My goal is to avoid to uncompress file, passing the uncompressed string to the http response builder who will have to recompress it 99% of time (in br or gzip probably). Storage volume is not a problem in my case.
Best Regards,
brotlipy looks straight forward enough, happy to take a pull request - https://python-hyper.org/projects/brotlipy/en/latest/
@Zeuh take a look at ENCODING_TYPE in src/hibp_downloader/__init__.py
line 28:
# encoding_type
# The encoded response-content is stored as-is without trying to decode (ie decompress) it into a new encoding type
# for local storage; Because this content is stored as-is, it is more useful to use "gzip" because the command-line
# tools (eg zcat, zgrep) are more readily available than brotli enabled tools when examining the data-store files
ENCODING_TYPE = "gzip" # values: [ gzip | br | None ]
If you want --encoding as a command line parameter, take a look at my PR https://github.com/threatpatrols/hibp-downloader/pull/4 where I added parameters for --http-proxy and --http-verify that gets passed all the way down to httpx_binary_response(url, etag=None, method="GET", encoding="gzip", timeout=10, max_retries=3, proxy="", verify="", __attempt=0, debug=False)
where the encoding parameter also needs to be passed to.
If someone decides to go ahead and implement .br
please do remember to implement br
in the function load_datafile()
so the command-line query
sub-command works as expected too.
There is already a clear if/then block in load_datafile()
to do this and the brotlipy library looks like it will be fairly seemless to use too.
Hi,
I would like to run a local clone of HIBP Passwords API with data from hibp-downloader and I would like to have files with no compression to serve it directly. It is possible to have option to choose between gzip|br|none ?
Thx