Closed benoit74 closed 1 day ago
@benoit74 I'd like to work on this.
One possible solution is to convert the list into a set and back to a list again so that duplicates will be removed.
self.tags = list(set([*self.tags, "_category:ted", "ted", "_videos:yes"]))
WDYT?
Should probably be done in scraperlib
Should probably be done in scraperlib
Agreed, let's transfer the issue.
@dan-niles yes, that's the idea, but to do in scraperlib so that it benefit all scrapers, are you still interested?
@benoit74 Sure, I'm up for it. I think we can remove the duplicates inside the config_metadata method in the scraperlib code.
I noticed that some scrapers like ted
and youtube
use the make_zim_file function from scraperlib, which initializes a Creator object and calls the config_metadata
method.
While warc2zim
and kolibri
initialize a Creator object and calls the config_metadata
method directly.
Since these scrapers eventually end up calling the config_metadata
method, I think if we do the deduplication there, we only have to update in one place. What do you think?
Yep, this makes sense. Good observations!
Strongly related to #164, should be implemented together
When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake.
https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93