threatpatrols / hibp-downloader

Efficiently download new pwned password hashes from api.pwnedpasswords.com fast
https://hibp-downloader.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 3 forks source link

Option to choose compression ? #3

Open Zeuh opened 6 months ago

Zeuh commented 6 months ago

Hi,

I would like to run a local clone of HIBP Passwords API with data from hibp-downloader and I would like to have files with no compression to serve it directly. It is possible to have option to choose between gzip|br|none ?

Thx

ndejong commented 6 months ago

Hi @Zeuh you'll really want to take a closer look at how hibp-downloader handles the storage of files, in short you very very likely do not need to decompress the data.

Before I get into the details below, the choice of downloading and storing in gzip is a very deliberate choice since nice command-line tools exist for gzip (eg zcat, zgrep) that do not exist for Brotli (yet)

Here's what happens under the hood -

The compute-time/storage-space tradeoff favors keeping the datafiles natively compressed as-is directly from download and doing a small amount of compute (ie the decompression) when needed.

I can imagine a simple wrapper function that makes it possible to do something like this might be a good thing for you and others

query_hibp_datastore(data_path="/path/to/data", password="password123", hash_type="sha1")

Finally, if you still really, really want to store the data decompressed then you could use the hibp-downloader generate subcommand that will generate one massive datafile that is similar to the one generated by the original hibp download tool created by Troy's team.

Zeuh commented 6 months ago

Thanks for this complete answer and for this tool who is the most efficient and complete :)

I already known how it works but if I would like to build a local clone of the HTTP Pwned Passwords API, have all file uncompressed is more simple, I get the data and give it to fastapi or whatever lib to build the HTTP response (witch will be compressed or not depending http client capabilities).

My goal is to avoid to uncompress file, passing the uncompressed string to the http response builder who will have to recompress it 99% of time (in br or gzip probably). Storage volume is not a problem in my case.

Best Regards,

ndejong commented 6 months ago

brotlipy looks straight forward enough, happy to take a pull request - https://python-hyper.org/projects/brotlipy/en/latest/

took commented 5 months ago

@Zeuh take a look at ENCODING_TYPE in src/hibp_downloader/__init__.py line 28:

# encoding_type
# The encoded response-content is stored as-is without trying to decode (ie decompress) it into a new encoding type
# for local storage; Because this content is stored as-is, it is more useful to use "gzip" because the command-line
# tools (eg zcat, zgrep) are more readily available than brotli enabled tools when examining the data-store files
ENCODING_TYPE = "gzip"  # values: [ gzip | br | None ]

If you want --encoding as a command line parameter, take a look at my PR https://github.com/threatpatrols/hibp-downloader/pull/4 where I added parameters for --http-proxy and --http-verify that gets passed all the way down to httpx_binary_response(url, etag=None, method="GET", encoding="gzip", timeout=10, max_retries=3, proxy="", verify="", __attempt=0, debug=False) where the encoding parameter also needs to be passed to.

ndejong commented 5 months ago

If someone decides to go ahead and implement .br please do remember to implement br in the function load_datafile() so the command-line query sub-command works as expected too.

There is already a clear if/then block in load_datafile() to do this and the brotlipy library looks like it will be fairly seemless to use too.