openzim / ifixit

iFixit to ZIM scraper
GNU General Public License v3.0
25 stars 3 forks source link

Redirection error in latest (2022-09) ifixit zim (English and Russian version) #86

Closed laggykiller closed 1 year ago

laggykiller commented 1 year ago

The latest (2022-09) ifixit zim (English and Russian version) has redirection error ('redirected you too many times')

The zim in question is here: https://download.kiwix.org/zim/ifixit/ifixit_en_all_2022-09.zim https://download.kiwix.org/zim/ifixit/ifixit_ru_all_2022-09.zim

You can test it out now from here: https://library.kiwix.org/ifixit_en_all_2022-09 https://library.kiwix.org/ifixit_ru_all_2022-09

The previous versions (2022-06) are normal: https://download.kiwix.org/zim/ifixit/ifixit_en_all_2022-06.zim https://download.kiwix.org/zim/ifixit/ifixit_ru_all_2022-06.zim

Other languages are not affected

When viewed from kiwix-serve, it produces redirection error: image

When viewed from desktop application (e.g. Windows), the homepage is broken: image

kelson42 commented 1 year ago

@rgaudin @benoit74 Indeed, looks super serious, the homepage seems to redirect on itself!!!

benoit74 commented 1 year ago

This is weird, I did not released any change in the scraper since previous version. Did you released any big changes in any underlying library ?

rgaudin commented 1 year ago

Both were created using openzim/ifixit:0.2.1 so can't be related to a code change anywhere.

kelson42 commented 1 year ago

@rgaudin @benoit74 I strongly suspect this is linked to something which has changed upstream. Anyway the problem is accute! We should disable the recipes and remove the problematic ZIM files from the repo!

rgaudin commented 1 year ago

@kelson42 this might be a kiwix-serve issue. The ZIM entries look correct:

# this is OK. that's that's what it always looks like
zim.main_entry.is_redirect
> True
zim.main_entry.get_redirect_entry()
> Entry(url=Main-Page, title=Main-Page)

# here we see that we redirect from Main-Page to home/home. Again, that's normal. We do it everywhere
zim.main_entry.get_redirect_entry().is_redirect
> True
zim.main_entry.get_redirect_entry().get_redirect_entry()
> Entry(url=home/home, title=iFixit: The Free Repair Manual)

Problem revolves around the handling of /:

curl -I http://192.168.5.80:9999/ifixit_en_all_2022-09/
HTTP/1.1 302 Found
Connection: close
Content-Length: 0
Location: /ifixit_en_all_2022-09/
Access-Control-Allow-Origin: *
Cache-Control: no-cache, no-store, must-revalidate
Date: Mon, 03 Oct 2022 09:55:23 GMT
======================
Requesting :
full_url  : /ifixit_en_all_2022-09/
method    : OTHER (1)
version   : HTTP/1.1
request#  : 0
headers   :
 - accept : '*/*'
 - host : '192.168.5.80:9999'
 - user-agent : 'curl/7.79.1'
arguments :
Parsed :
full_url: /ifixit_en_all_2022-09/
url   : /ifixit_en_all_2022-09/
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 302
headers :
 - Location: '/ifixit_en_all_2022-09/'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
Request time : 0.005543s
----------------------

Accessing the home/home entry directly works as expected

curl -I http://192.168.5.80:9999/ifixit_en_all_2022-09/home/home
HTTP/1.1 200 OK
Connection: close
Content-Length: 21948
Content-Type: text/html
Access-Control-Allow-Origin: *
ETag: "1664790915539257/c"
Cache-Control: max-age=2723040, public
Date: Mon, 03 Oct 2022 09:57:59 GMT
======================
Requesting :
full_url  : /ifixit_en_all_2022-09/home/home
method    : OTHER (1)
version   : HTTP/1.1
request#  : 1
headers   :
 - accept : '*/*'
 - host : '192.168.5.80:9999'
 - user-agent : 'curl/7.79.1'
arguments :
Parsed :
full_url: /ifixit_en_all_2022-09/home/home
url   : /ifixit_en_all_2022-09/home/home
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Found home/home
mimeType: text/html
Response :
httpResponseCode : 200
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1664791208912869/c"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.003537s
----------------------

Kiwix-JS is not affected by this bug, nor is kiwix-desktop macOS (97)

rgaudin commented 1 year ago

As stated in linked ticket on libkiwix, this is due to an empty-path entry:

[MainThread::2022-09-16 10:39:33,957] DEBUG:Normalizing href https://www.ifixit.com/Info/Sales Policies
[MainThread::2022-09-16 10:39:34,874] DEBUG:Result is 
[MainThread::2022-09-16 10:39:34,877] DEBUG:Adding item in ZIM at path ''
rgaudin commented 1 year ago

Problem is in normalize() which doesn't expect the redirect to be on a different domain…

https://github.com/openzim/ifixit/blob/f387d7d7fd561e3593f8548d4f16bb020e7f45b8/ifixit2zim/shared.py#L335-L357

rgaudin commented 1 year ago

Moved those two 2022-09 ZIMs to dev. Releasing and relaunching both