openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Deduplicate ZIM tag values #156

Closed benoit74 closed 1 day ago

benoit74 commented 2 months ago

When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake.

https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93

dan-niles commented 2 months ago

@benoit74 I'd like to work on this.

One possible solution is to convert the list into a set and back to a list again so that duplicates will be removed.

self.tags = list(set([*self.tags, "_category:ted", "ted", "_videos:yes"]))

WDYT?

rgaudin commented 2 months ago

Should probably be done in scraperlib

benoit74 commented 2 months ago

Should probably be done in scraperlib

Agreed, let's transfer the issue.

@dan-niles yes, that's the idea, but to do in scraperlib so that it benefit all scrapers, are you still interested?

dan-niles commented 2 months ago

@benoit74 Sure, I'm up for it. I think we can remove the duplicates inside the config_metadata method in the scraperlib code.

I noticed that some scrapers like ted and youtube use the make_zim_file function from scraperlib, which initializes a Creator object and calls the config_metadata method. While warc2zim and kolibri initialize a Creator object and calls the config_metadata method directly.

Since these scrapers eventually end up calling the config_metadata method, I think if we do the deduplication there, we only have to update in one place. What do you think?

benoit74 commented 2 months ago

Yep, this makes sense. Good observations!

benoit74 commented 3 weeks ago

Strongly related to #164, should be implemented together