openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Display all metadata in debug log level #155

Closed benoit74 closed 2 weeks ago

benoit74 commented 2 months ago

As discussed in https://github.com/openzim/warc2zim/issues/123, we would benefit from logging the metadata which are used, at least all text values.

Regarding illustration, do we want to log the base64 value? It might be useful for debug as well, but not always negligible in log size.

I recommend to do it right at the beginning of the start method, before check of presence of mandatory metadatas and before potential validation, so that it is always logged.

@rgaudin @kelson42 WDYT?

@richterdavid do you confirm you wanna implement this issue? Please wait a little bit for arguments to settle here before rushing into any implementation, we need to confirm everyone is aligned on the same page

richterdavid commented 2 months ago

@richterdavid do you confirm you wanna implement this issue?

Happy to.

richterdavid commented 2 months ago

@benoit74 by start() you meant Creator.start here?

How about setting a command-line flag for how much of the illustration to log? Default it to something (e.g., 100 bytes), and support two sentinel values representing "nothing" and "everything".

benoit74 commented 2 months ago

by start() you meant Creator.start here?

Yes

How about setting a command-line flag for how much of the illustration to log? Default it to something (e.g., 100 bytes), and support two sentinel values representing "nothing" and "everything".

Python-scraperlib is a library, so there is no such things as command-line flag. But we could add an argument e.g. to Creator init() method.

But the idea of logging only the first 100 bytes makes little sense to me, it has little value. It might only be used to check mime type, but then I would rather prefer that we log only the illustration mime type (python-scraperlib already has everything needed to detect it). Logging nothing is then not a big win and I don't expect scrapers to be willing to use this alternative (you do not mind about one extra small log line usually). And logging everything, if optional, is then better done directly in the scraper rather than in python-scraperlib.

So to sum-up: I propose to log all raw metadata except for the illustration where we log only its mime type. And no new argument to any function.

WDYT?