openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
19 stars 16 forks source link

Add support for alias (in addition to redirection) #148

Open benoit74 opened 6 months ago

benoit74 commented 6 months ago

Just like we have add_redirect in zim/creator and add_redirects_to_zim in zim/filesystem, we should now add support for the "new" ZIM alias with add_alias in zim/creator and add_aliases_to_zim in zim/filesystem

rgaudin commented 6 months ago

I'd like us to implement this as zimwriterfs does (I don't think it's there yet) so this features keeps being a close-to-drop-in alternative (we should implement a drop-in zimwriterfs script at some point). If we are to implement this first, maybe coordinate with @kelson42 so the input can be the same.

benoit74 commented 6 months ago

@kelson42 three questions:

kelson42 commented 6 months ago
  • do you have any plans to add support for aliases in zimwriterfs?

No, not even a ticket! :(

  • do you intend to use a different format that the one used for redirects? (libzim API seems identical)

I guess you refer to the --redirects option, but this is actually not the only way to create redirects in a ZIM. If you have an HTML file with an HTML redirect, then it will create a redirect as well.

I believe that checking the files (symlink, hardlinks, same files) would be the first approach to create aliases.

If there is a clear need to support a similar option like --redirects for aliases, then I guess we will have to implement it.

  • shouldn't we add support for isFront hint in the TSV file for both redirect and alias? (covering all hints is pretty difficult but this one is pretty important is most scenarii)

AFAIK (hope I'm right here) everything is Front for zimwriterfs... so not sure about the need, but here again if there is a need, we will have too.

benoit74 commented 3 months ago

OK, then let's remove the milestone for now until the needs become clearer. I don't even remember when I found this need. Probably around Youtube or TED scrapers ... not sure.