openzim / ifixit

iFixit to ZIM scraper
GNU General Public License v3.0
25 stars 3 forks source link

Fix issue with unquoted normalized URLs before regex matching #82

Closed benoit74 closed 2 years ago

benoit74 commented 2 years ago
[MainThread::2022-05-26 05:52:56,109] DEBUG:Processing category iphone_se_2022
[MainThread::2022-05-26 05:52:56,111] DEBUG:Normalizing href https://es.ifixit.com/Topic/iPad_Air_%285th_Generation%29
[IMG-T-39::2022-05-26 05:52:56,326] DEBUG:'images/https/guide-images.cdn.ifixit.com/igi/dQwnOBPEYGpiRuCL.standard' found in S3
[MainThread::2022-05-26 05:52:56,638] DEBUG:Result is /Device/iPad_Air_(5th_Generation)
[MainThread::2022-05-26 05:52:56,639] DEBUG:Normalizing href https://es.ifixit.com/Device/iPad_Air_
[MainThread::2022-05-26 05:52:57,200] DEBUG:Result is /Device/iPad_Air
[MainThread::2022-05-26 05:52:57,201] WARNING:Adding unexpected category ipad_air_ to scraping queue
[MainThread::2022-05-26 05:52:57,201] DEBUG:Normalizing href https://es.ifixit.com/Topic/iPhone_SE_2020
[MainThread::2022-05-26 05:52:57,707] DEBUG:Result is /Device/iPhone_SE_2020
[MainThread::2022-05-26 05:52:57,709] DEBUG:Normalizing href https://es.ifixit.com/Device/iPhone_SE_2020
[MainThread::2022-05-26 05:52:57,845] DEBUG:Result is /Device/iPhone_SE_2020
[MainThread::2022-05-26 05:52:58,310] DEBUG:Adding item in ZIM at path 'Device/iPhone_SE_2022'

The unexpected category "ipadair" is a bit weird since it does not exists indeed

benoit74 commented 2 years ago

Looks like the scraper is normalizing the URL twice in such a situation:

The issue is that the result of the first regex matching is not quoted again and we hence finish with only a partial match in the regex since not all characters are permitted.