openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Simplify debugging of WARC record processsing issues #252

Closed benoit74 closed 4 months ago

benoit74 commented 4 months ago

Fix #248

benoit74 commented 4 months ago

it's a dev feature that should be advertised as such.

Agreed.

it should be properly documented. Usage doesn't mention that failed records are stored on the filesystem.

Agreed, I will indeed even add a CLI argument to specify where one wants to store these failed items (with same default as currently, but having it is both useful in some occasions + a great way to make the feature more visible)

why storing the data as base64 on the filesystem? base64 is a transport-over-text format. I don't see any reason to use it here. Am I missing something?

Because at first I intended to log this content, so base64 was mostly mandatory to avoid issues. I then realized it was too risky / complex / cumbersome to push potentially big content (e.g. pdfs, ...) to the log and decided to store them on the filesystem. But I didn't recall that base64 was then not needed anymore. Anyway, I agree we should store the item as-is, in "plain".

benoit74 commented 4 months ago

Sorry, small typo found in documentation after asking for review.

benoit74 commented 4 months ago

@rgaudin up, rewiew required please