Closed aisuneko closed 1 year ago
I've created a .slob with this tool which reproduces this problem: archwiki-20221126.slob
which article? Poking at random articles for a few minutes I can't find any that include images. Also, pages in File:
namespace don't seem to be included? By default mwscrape is only downloading articles (namespace with id 0 and no prefix) and links to non-article pages are converted as external links. In any case, I'm away for the next few weeks, I can investigate afterwards
@itkach Example page: Laptop/Lenovo I'm aware that mwscrape doesn't download images; my problem here is that it's not properly converting links to non-article pages as external links, as shown in the given example.
my bad, I've already applied the image filter to that slob try this one instead: archwiki_image_link_test_not_working.slob
A little investigation:
when I place a breakpoint at the said line of convert.py
and run the script with the said example, it prints:
https://wiki.archlinux.org, /title/, /title/, /title/File:Tango-edit-clear.png
(server, articlepath, site_articlepath, url)
Seems that it fails to extract the actual image url if the images are placed somewhere else (e.g. under another subdirectory) in the wiki. (In this case, the images are under https://wiki.archlinux.org/images/
, not /title/
).
Wikipedia sites, however, are working just fine. Maybe it's because we won't need to scrape them ourselves with the WE HTML dumps? It's just kinda weird.
URLs like https://wiki.archlinux.org/title/File:Tango-edit-clear.png
is what the image links to (href
of image's parent <a>
). It is correctly converted as external link, if you click the broken image placeholder, that's the external page that opens, as expected. Image's sources, however, like /images/c/c9/Merge-arrows-2.png
are not converted correctly, because, indeed, it is assumed that images live under the same path as articles (/title in this wiki), which is not the case here. I no longer recall why this assumption was made 🤷. For Wikipedia sites this never comes into play because all images are hosted at upload.wikimedia.org and are referenced by their absolute URLs. I have some thoughts on how to fix it, I'll give it a try when I'm back in front of computer.
When dumping a database of a random MediaWiki site created with
mwscrape
(apart from wikipedia), the external image URLs generate will be broken, even with the--ensure-ext-image-urls
flag enabled - thus, "Load remote content" on Aard2 clients won't work with the images included in those resulting slob files. An example: source:https://wiki.archlinux.org/title/File:Merge-arrows-2.png
target:https://wiki.archlinux.org/images/c/c9/Merge-arrows-2.png
broken result in generated slob:https://wiki.archlinux.org/title//c/c9/Merge-arrows-2.png
(In some other cases, the generated links are working fine, but "Load remote content" still won't work properly)I suspect that it's related to https://github.com/itkach/mw2slob/blob/078ac8303e4ae625d572b8029f6a7c96cf17d980/mw2slob/convert.py#L212-L213 but unfortunately I was unable to figure it out, for I'm unclear of how a "File:" relative link is converted to the actual path of the file/image in its respective commons. Or perhaps it depends on the structure of those respective MediaWikis?
I've created a .slob with this tool which reproduces this problem:
archwiki-20221126.slobnot this one but this: archwiki_image_link_test_not_working.slob@itkach