openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
20 stars 18 forks source link

[next major] remove **extra from Creator.config_metadata #205

Open rgaudin opened 1 month ago

rgaudin commented 1 month ago

In https://github.com/kiwix/operations/issues/286 we had two misspelled yet undetected metadata: tags and scraper.

I think accepting extra metadata in this method defeats the purpose of having them all exposed. I also think it's use is marginal and that additional metadata can still be added by other means.

@benoit74 Can we get rid of this?

benoit74 commented 1 month ago

I would consider to split it in two: config_std_metadata (to be used by default) and config_extra_metadata (for those scraper like warc2zim who want to add custom metadata). This seems important to me so that both method can still benefit from same logic (currently we remove control characters for instance, but we might add more logic in the future). I recommend to even force config_extra_metadata to force the X- prefix we used in warc2zim for X-ContentDate, so that we limit even further the risks of strange metadata. WDYT?

And obviously we need to keep config_indexing till the next major.

rgaudin commented 1 month ago

I recommend to even force config_extra_metadata to force the X- prefix we used in warc2zim for X-ContentDate, so that we limit even further the risks of strange metadata. WDYT?

Works for me, as long as there's still the possibility to add non-prefixed metadata (via add_metadata()).