openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Add utility function to compute ZIM Tags #164

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

All scrapers are setting ZIM tags based on a user-provided string with semi-colon separator between values (or at least they should).

Some scrapers are also setting few tags automatically, in addition to the user-provided tags.

This list of tags should be de-duplicated and tags provided by user should be trimmed from any leading / trailing whitespace.

Having a utility function at zimscraperlib level to share this logic would help avoid reinventing the wheel over and over again. This function would take two parameters: default_tags (list of str) and user_tags (str) and return a list of tags ready to be passed to the creator (or a set? would be better if the creator supports passing a set, to be checked at validate_tags and libzim levels).

warc2zim is going to have what looks like a promising implementation (after https://github.com/openzim/warc2zim/pull/267 is merged).

benoit74 commented 3 weeks ago

Strongly related to https://github.com/openzim/python-scraperlib/issues/156, should be implemented together