openzim / zim-tools

Various ZIM command line tools
https://download.openzim.org/release/zim-tools/
GNU General Public License v3.0
123 stars 34 forks source link

Zimcheck internal URL checking seems to ignore URLencoding AND HTML entities #378

Closed kelson42 closed 10 months ago

kelson42 commented 11 months ago

One of the most important feature of zimcheck seems to be really buggy and weak. The checking of internal URL, ie. verifying that URLs in the HTML point to real entries in the ZIM, seem to just take the href value from the HTML and search it - as it - in the archive.

Which means that there will be an error wrongly returned if:

This is the last scenario which happen with this ZIM: wikipedia_en_canada_2023-10.zim.zip

I got the error:

$ zimcheck wikipedia_en_canada_2023-10.zim 
[INFO] Checking zim file wikipedia_en_canada_2023-10.zim
[INFO] Zimcheck version is 3.2.0
[INFO] Verifying ZIM-archive structure integrity...
[INFO] Avoiding redundant checksum test (already performed by the integrity check).
[INFO] Checking metadata...
[INFO] Searching for Favicon...
[INFO] Searching for main page...
[INFO] Verifying Articles' content...
[INFO] Searching for redundant articles...
  Verifying Similar Articles for redundancies...
[INFO] Checking for redirect loops...
[WARNING] Redundant data found:
  -/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-pt-br.vtt and -/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-pt.vtt
[ERROR] Invalid internal links found:
  The following links:
- ../-/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-bg.vtt
(-/File:"O_Canada",_performed_by_the_United_States_Third_Marine_Aircraft_Wing_Band.oga-bg.vtt) were not found in article A/Canada
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: <3 seconds.
kelson42 commented 11 months ago

@veloman-yunkan @mgautierfr I'm very surprised to discover that hairy bug so late. Please confirm and possibility fix (should be complicated) ASAP. Actually by hardening the testing around MWoffliner, this bug has been discovered.

For the rest it seems to work and glad to merge and release in 3.3.0