openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
40 stars 2 forks source link

New ZIM request: scp-wiki.wikidot.com #402

Open IMayBeABitShy opened 2 years ago

IMayBeABitShy commented 2 years ago

As the majority of this website (the exception being the 'random article'-buttons, login functionality and search) does not seem to need any backend whatsoever, downloading it via zimit seems like a viable option. Nearly everything on the website is CC-SA, the only exception (?) being the image of SCP-173, but excluding that one should be easy when using zimit. I am not even sure if it even needs to be excluded.

lbrunkho commented 2 years ago

+1 for this. I actually made a clone of this website using httrack about a year ago and it was an ORDEAL! Would much rather this be in a zim file for my kiwix server. On a side-note the image of 173 is going to get a redesign in the near future to avoid this issue.

Popolechien commented 2 years ago

downloading it via zimit seems like a viable option. Nearly everything on the website is CC-SA, the only exception (?) being the image of SCP-173, but excluding that one should be easy when using zimit

@IMayBeABitShy Have you tried using the limited version of zimit already? did it work?

I had a cursory look but cannot see whether this is mediawiki-based or not.

IMayBeABitShy commented 2 years ago

Following @Popolechien's suggestion, I've used youzim.it to create a limited ZIM of the site. It seems like the website works (obviously some stuff like the search doesn't, but zim files have their own search functionality anyway). I did, however, noticed that a lot of junk javascript has been included (e.g. cookie confirmation, ...).

I suggest also excluding the following sites:

This list is probably incomplete, but this should be the most important ones on the main page.

I had a cursory look but cannot see whether this is mediawiki-based or not.

I don't think it is. There is a wikidot -> mediawiki conversion tool, which also indicates that it's not a media wiki. Still, I only have superficial knowledge of wiki software, so I may be wrong.

Popolechien commented 2 years ago

@IMayBeABitShy Awesome, I've started a recipe. Let us see what happens.

IMayBeABitShy commented 2 years ago

I think this one failed. I've checked the log a couple of times and zimit seemed to spend a lot of time parsing some background pages (like workbench I think they were called). The last time I've checked, the job was finally interrupted.

lbrunkho commented 1 year ago

Looks like the favicon URL has changed. New URL: https://scp-wiki.wikidot.com/local--favicon/favicon.gif Also, the recipe log is flooded with these errors. I unfortunately am not familiar enough with zimit to know what this means.

[2023-07-02 17:23:23,192: WARNING] failed to load progress details: Expecting value: line 1 column 1 (char 0)

We can also omit the copyright concern with scp-173 image as this has been removed from the site to adhere to CC BY-SA license.

lbrunkho commented 7 months ago

Another update to this request, the attempt on December 29, 2023 was successful! The resulting ZIM was usable, however, it looks like the depth needs to be increased by at least one.

https://farm.openzim.org/pipeline/6cc5755f-e0de-4a4c-a22f-fa9e43a0603f

Articles listed on the homepage are indexed but the majority of articles are under the series page that are just too deep.

https://scp-wiki.wikidot.com/scp-series

MCSeekeri commented 2 weeks ago

I noticed something very strange ...... all the offset pages are not being crawled correctly. Also, since the site uses Crom search, I think *.crom.avn.sh should be added to the exclusion list as well.

lbrunkho commented 2 weeks ago

@Popolechien can you reopen this issue or update the recipe for this?

Popolechien commented 2 weeks ago

Just so everyone is on the same page the latest version available is at https://dev.library.kiwix.org/viewer#scp-wiki_en_all

As far as poking at the zimit recipe goes I'll defer to @benoit74

benoit74 commented 2 weeks ago

@lbrunkho @MCSeekeri I'm sorry but I don't get what your issues are.

Can you please provide link to a page with a non-working link (and details about this non-working link, e.g. position on the screen, text, screenshot, ...) so that I can understand what you are speaking about?

MCSeekeri commented 2 weeks ago

@lbrunkho @MCSeekeri I'm sorry but I don't get what your issues are.

Can you please provide link to a page with a non-working link (and details about this non-working link, e.g. position on the screen, text, screenshot, ...) so that I can understand what you are speaking about?

SCP-2998 The "Next iteration" at the bottom jumps to /offset/1 Zimit is not crawling correctly, it seems to be because the page returns 503.

{"timestamp":"2024-09-25T13:49:45.047Z","logLevel":"error","context":"general","message":"Page Crashed on Load","details":{"status":503,"page":"https://scp-wiki.wikidot.com/scp-2998/offset/1","workerid":0}}

There are also some issues that don't exist in the current zim file. I found them while crawling SCP-CN. SVG and MathJax The crawled version doesn't render SVGs correctly and doesn't display math formulas correctly, which is probably due to Wikidot's weird front-end implementation, so both of these issues can be left alone for the time being.

benoit74 commented 2 weeks ago

If the page returns a 503, unfortunately there is nothing we can do ... But here the message says "Page Crashed on Load", so I suspect there is another issue. Will have a look when time will be available to work on this ZIM request.

MCSeekeri commented 2 weeks ago

If the page returns a 503, unfortunately there is nothing we can do ... But here the message says "Page Crashed on Load", so I suspect there is another issue. Will have a look when time will be available to work on this ZIM request.

The strange thing is that the page doesn't actually return 503, the content is normal, I'm not sure why there is this output ......