openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Bad redirect location should be ignored instead of stopping crawler #356

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

Task: https://farm.zimit.kiwix.org/pipeline/6715f8a2-fd07-49db-ad5c-c7ff49431448/debug

Command:

zimit --url=https://dujardindansmavie.com --name=dujardindansmavie.com_0b0a7ff0 --zim-file=dujardindansmavie.com_0b0a7ff0.zim --userAgentSuffix=zimit.kiwix.org+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json --adminEmail=contact+zimfarm@kiwix.org --keep --publisher=openZIM

Error:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 507, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 146, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 265, in run
    self.gather_information_from_warc()
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 401, in gather_information_from_warc
    HttpUrl(urljoin(url, redirect_location))
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/url_rewriting.py", line 74, in __init__
    HttpUrl.check_validity(value)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/url_rewriting.py", line 103, in check_validity
    raise ValueError(f"Unsupported upper-case chars in hostname : {value}")
ValueError: Unsupported upper-case chars in hostname : http://Echo%2520-%2520Banniere%2520%C3%A9t%C3%A9%25202024

It looks a bit sad to stop conversion for just one bad URL ... but how did it achieved to make it as a WARC record? This is clearly not supposed to happen.

benoit74 commented 1 month ago

The problem is that we have a redirect which targets this bad URL http://Echo%2520-%2520Banniere%2520%C3%A9t%C3%A9%25202024, this is not something we find inside the WARC.

We should probably just ignore these bad redirections.