openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Dynamic URL rewriting in Wombat is not working on URLs unicode characters #257

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

Sample case:

HTML page located at ZIM path iranwire.com/fa/news-1/128508-%DA%A9%D8%B4%D8%AA%D9%87-%D8%B4%D8%AF%D9%86-%DB%8C%DA%A9-%DA%A9%D9%88%D9%84%D8%A8%D8%B1-%D8%AF%DB%8C%DA%AF%D8%B1-%D8%A8%D8%B1-%D8%A7%D8%AB%D8%B1-%D8%B4%D9%84%DB%8C%DA%A9-%D9%86%DB%8C%D8%B1%D9%88%D9%87%D8%A7%DB%8C-%D9%85%D8%B1%D8%B2%D8%A8%D8%A7%D9%86%DB%8C-%D8%A7%D8%B2-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%DB%8C%DA%A9-%D9%85%D8%AA%D8%B1%DB%8C/

Kiwix-serve serving the ZIM at http://example.com/content/iranwire-com_far_all_2024-05/

In the browser (recent Firefox on Mac OSat least), URL is displayed as http://example.com/content/iranwire-com_far_all_2024-05/iranwire.com/fa/news-1/128508-کشته-شدن-یک-کولبر-دیگر-بر-اثر-شلیک-نیروهای-مرزبانی-از-فاصله-فاصله-یک-متری/ but real URL fetched is correct: http://example.com/content/iranwire-com_far_all_2024-05/iranwire.com/fa/news-1/128508-%DA%A9%D8%B4%D8%AA%D9%87-%D8%B4%D8%AF%D9%86-%DB%8C%DA%A9-%DA%A9%D9%88%D9%84%D8%A8%D8%B1-%D8%AF%DB%8C%DA%AF%D8%B1-%D8%A8%D8%B1-%D8%A7%D8%AB%D8%B1-%D8%B4%D9%84%DB%8C%DA%A9-%D9%86%DB%8C%D8%B1%D9%88%D9%87%D8%A7%DB%8C-%D9%85%D8%B1%D8%B2%D8%A8%D8%A7%D9%86%DB%8C-%D8%A7%D8%B2-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%DB%8C%DA%A9-%D9%85%D8%AA%D8%B1%DB%8C/

In such a condition URL rewriting is failing, producing invalid links.

The problem is that the prefix which is passed to JS URL rewriting function should be http://example.com/content/iranwire-com_far_all_2024-05/ while it is http://example.com/content/iranwire-com_far_all_2024-05/iranwire.com/fa/news-1/128508-%DA%A9%D8%B4%D8%AA%D9%87-%D8%B4%D8%AF%D9%86-%DB%8C%DA%A9-%DA%A9%D9%88%D9%84%D8%A8%D8%B1-%D8%AF%DB%8C%DA%AF%D8%B1-%D8%A8%D8%B1-%D8%A7%D8%AB%D8%B1-%D8%B4%D9%84%DB%8C%DA%A9-%D9%86%DB%8C%D8%B1%D9%88%D9%87%D8%A7%DB%8C-%D9%85%D8%B1%D8%B2%D8%A8%D8%A7%D9%86%DB%8C-%D8%A7%D8

This prefix is computed dynamically from current_url.substring(0, current_url.length - ("{{ path }}".length))

current_url is http://example.com/content/iranwire-com_far_all_2024-05/iranwire.com/fa/news-1/128508-%DA%A9%D8%B4%D8%AA%D9%87-%D8%B4%D8%AF%D9%86-%DB%8C%DA%A9-%DA%A9%D9%88%D9%84%D8%A8%D8%B1-%D8%AF%DB%8C%DA%AF%D8%B1-%D8%A8%D8%B1-%D8%A7%D8%AB%D8%B1-%D8%B4%D9%84%DB%8C%DA%A9-%D9%86%DB%8C%D8%B1%D9%88%D9%87%D8%A7%DB%8C-%D9%85%D8%B1%D8%B2%D8%A8%D8%A7%D9%86%DB%8C-%D8%A7%D8%B2-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%D9%81%D8%A7%D8%B5%D9%84%D9%87-%DB%8C%DA%A9-%D9%85%D8%AA%D8%B1%DB%8C/ (which is correct)

But path is iranwire.com/fa/news-1/128508-کشته-شدن-یک-کولبر-دیگر-بر-اثر-شلیک-نیروهای-مرزبانی-از-فاصله-فاصله-یک-متری/ which is obviously doomed to fail the computation of prefix

We obviously need to quote / url-encode this path before passing it to the HTML. Path is always fully decoded (it is the ZIM path) while current_url is always encoded (it is the browser URL: window.location.href minus the fragment/hash ).

benoit74 commented 1 month ago

Fixed by https://github.com/openzim/warc2zim/pull/258