Closed benoit74 closed 1 month ago
As discussed live and proposed in https://github.com/openzim/python-scraperlib/issues/119, we could just disable the metadata check in scraperlib.
This could be an opt-in flag in general, and the default when using playlist mode. And we could display a warning when metadata is not valid.
This would allow to continue to support this mode for the ones wanting to create their own ZIMs, while still ensuring metadata quality for openZIM files. And would allow to upgrade to 3.x in an elegant way.
@kelson42 WDYT?
This approach has been implemented in TED scraper: https://github.com/openzim/ted/pull/170
In this scraper, we have a
playlist_mode
which allows to create a ZIM per playlist found in a given Youtube user / channel.This mode is convenient to create many ZIMs at once, but it poses an issue in terms of metadata quality since titles, descriptions, ... are automatically sourced from Youtube.
With the move to scraperlib 3.x, the creation of ZIMs with invalid title, description, ... will fail. Unfortunately, this check is done only at the end of the scraping since we still use the "zimwriterfs" mode with
make_zim_file
at the end of the scraper, after all videos have been downloaded and reencoded.We should either:
This is a blocker for #175 in fact (or we accept to have a functionality which will not work in 90% of the cases)