openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Do not rewrite href containing only a fragment #277

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

On hra-news.org, we have some anchor tag in HTML document with an href which is just an anchor like <a href="#somevalue">

image

This is rewritten as <a href="./#tab2-3"> which unfortunately breaks the JS on hra-news.org (even if it is probably a more valid URL)

I suggest the scraper should detect such URL values starting with a # and keep these URLs as-is.

Rewritten URL is anyway probably a bit wrong because it means nothing to have a path just like ./

benoit74 commented 1 month ago

Rewritten URL is anyway probably a bit wrong because it means nothing to have a path just like ./

Wrong, this is OK indeed when the current URL ends with a folder like https://www.kiwix.org/article/.

We should hence just not rewrite URLs composed of "just a fragment"