itkach / mw2slob

A tool to convert MediaWiki content to dictionaries in slob format
GNU General Public License v3.0
19 stars 4 forks source link

In some dictionaries images from the network are not loaded #16

Closed sklart closed 1 year ago

sklart commented 2 years ago

In some dictionaries images from the network are not loaded. Dictionary examples are available here: neolurk-org-20220821.lzma2.slob 364.8MiB ru.gta.fandom.com-20220822.lzma2.slob 17.7MiB

Commands used to create dictionaries:

mwscrape -c http://admin:admin@localhost:5984 gta.fandom.com/ru --site-path=/ --speed 3 mw2slob scrape http://admin:admin@localhost:5984/gta-fandom-com/ru --no-math --ensure-ext-image-urls -a sklart

mwscrape -c http://admin:admin@localhost:5984 http://neolurk.org/ --speed 5 mw2slob scrape http://admin:admin@localhost:5984/neolurk-org -a sklart -f common wiki

Please help me with this problem. Thanks

aisuneko commented 1 year ago

same here: https://github.com/itkach/mw2slob/issues/15

itkach commented 1 year ago

I think fix for #15 should also fix neolurk , I can't verify since the site gives mwscrape 403 at the moment (even if api requests seem to be working from the browser/curl).

As for https://gta.fandom.com/ - all the images are already external (no need for --ensure-ext-image-urls), mostly from static.wikia.nocookie.net. The issue is that they check "Referer" header and return 404 if it is present and not a whitelisted host such as gta.fandom.com. You can verify this by making requests with curl and comparing responses:

curl -v "https://static.wikia.nocookie.net/gta/images/9/9c/Post.png/revision/latest?cb=20160106165540&path-prefix=ru"

gives http status 200 and returns correct image as png (or webp for browser) while

curl -v -H"Referer:http://localhost:8013" "https://static.wikia.nocookie.net/gta/images/9/9c/Post.png/revision/latest?cb=20160106165540&path-prefix=ru"

gives 404 and returns "missing image" jpeg.