Closed benoit74 closed 2 weeks ago
@richterdavid do you confirm you wanna implement this issue?
Happy to.
@benoit74 by start() you meant Creator.start here?
How about setting a command-line flag for how much of the illustration to log? Default it to something (e.g., 100 bytes), and support two sentinel values representing "nothing" and "everything".
by start() you meant Creator.start here?
Yes
How about setting a command-line flag for how much of the illustration to log? Default it to something (e.g., 100 bytes), and support two sentinel values representing "nothing" and "everything".
Python-scraperlib is a library, so there is no such things as command-line flag. But we could add an argument e.g. to Creator init()
method.
But the idea of logging only the first 100 bytes makes little sense to me, it has little value. It might only be used to check mime type, but then I would rather prefer that we log only the illustration mime type (python-scraperlib already has everything needed to detect it). Logging nothing is then not a big win and I don't expect scrapers to be willing to use this alternative (you do not mind about one extra small log line usually). And logging everything, if optional, is then better done directly in the scraper rather than in python-scraperlib.
So to sum-up: I propose to log all raw metadata except for the illustration where we log only its mime type. And no new argument to any function.
WDYT?
As discussed in https://github.com/openzim/warc2zim/issues/123, we would benefit from logging the metadata which are used, at least all text values.
Regarding illustration, do we want to log the base64 value? It might be useful for debug as well, but not always negligible in log size.
I recommend to do it right at the beginning of the
start
method, before check of presence of mandatory metadatas and before potential validation, so that it is always logged.@rgaudin @kelson42 WDYT?
@richterdavid do you confirm you wanna implement this issue? Please wait a little bit for arguments to settle here before rushing into any implementation, we need to confirm everyone is aligned on the same page