itkach / mw2slob

A tool to convert MediaWiki content to dictionaries in slob format
GNU General Public License v3.0
19 stars 4 forks source link

Broken external image URLs while parsing other MediaWikis #15

Closed aisuneko closed 1 year ago

aisuneko commented 1 year ago

When dumping a database of a random MediaWiki site created with mwscrape (apart from wikipedia), the external image URLs generate will be broken, even with the --ensure-ext-image-urls flag enabled - thus, "Load remote content" on Aard2 clients won't work with the images included in those resulting slob files. An example: source: https://wiki.archlinux.org/title/File:Merge-arrows-2.png target: https://wiki.archlinux.org/images/c/c9/Merge-arrows-2.png broken result in generated slob: https://wiki.archlinux.org/title//c/c9/Merge-arrows-2.png (In some other cases, the generated links are working fine, but "Load remote content" still won't work properly)

I suspect that it's related to https://github.com/itkach/mw2slob/blob/078ac8303e4ae625d572b8029f6a7c96cf17d980/mw2slob/convert.py#L212-L213 but unfortunately I was unable to figure it out, for I'm unclear of how a "File:" relative link is converted to the actual path of the file/image in its respective commons. Or perhaps it depends on the structure of those respective MediaWikis?

I've created a .slob with this tool which reproduces this problem: archwiki-20221126.slob not this one but this: archwiki_image_link_test_not_working.slob

@itkach

itkach commented 1 year ago

I've created a .slob with this tool which reproduces this problem: archwiki-20221126.slob

which article? Poking at random articles for a few minutes I can't find any that include images. Also, pages in File: namespace don't seem to be included? By default mwscrape is only downloading articles (namespace with id 0 and no prefix) and links to non-article pages are converted as external links. In any case, I'm away for the next few weeks, I can investigate afterwards

aisuneko commented 1 year ago

@itkach Example page: Laptop/Lenovo I'm aware that mwscrape doesn't download images; my problem here is that it's not properly converting links to non-article pages as external links, as shown in the given example.

my bad, I've already applied the image filter to that slob try this one instead: archwiki_image_link_test_not_working.slob

aisuneko commented 1 year ago

A little investigation: when I place a breakpoint at the said line of convert.py and run the script with the said example, it prints: https://wiki.archlinux.org, /title/, /title/, /title/File:Tango-edit-clear.png (server, articlepath, site_articlepath, url)

Seems that it fails to extract the actual image url if the images are placed somewhere else (e.g. under another subdirectory) in the wiki. (In this case, the images are under https://wiki.archlinux.org/images/, not /title/). Wikipedia sites, however, are working just fine. Maybe it's because we won't need to scrape them ourselves with the WE HTML dumps? It's just kinda weird.

itkach commented 1 year ago

URLs like https://wiki.archlinux.org/title/File:Tango-edit-clear.png is what the image links to (href of image's parent <a>). It is correctly converted as external link, if you click the broken image placeholder, that's the external page that opens, as expected. Image's sources, however, like /images/c/c9/Merge-arrows-2.png are not converted correctly, because, indeed, it is assumed that images live under the same path as articles (/title in this wiki), which is not the case here. I no longer recall why this assumption was made 🤷. For Wikipedia sites this never comes into play because all images are hosted at upload.wikimedia.org and are referenced by their absolute URLs. I have some thoughts on how to fix it, I'll give it a try when I'm back in front of computer.