View source for Main Page
You do not have permission to edit this page, for the following reason:
You can view and copy the source of your edits to this page.
Return to Main Page.
Closed vgambier closed 1 year ago
Hi @vgambier—thanks for bringing this to my attention! I recognize your name from a while ago, so welcome back!
Let me know if I should instead open an issue on the upstream wikiteam project.
Here are my thoughts on this:
I personally don't make much use of wikiteam3
, and I ended up maintaining this fork largely by accident because other people kept replying to my draft pull request on the upstream project. (You probably know all this already.)
I've talked with @nemobis personally, and while we came to the conclusion that my fork is probably not going to merge back with his upstream project anytime soon, it would probably be worth opening an equivalent issue there, as well, even if just to bring it to his attention. One reason this fork is unlikely to merge upstream anytime soon is that Federico's priorities as a member of Archive Team are somewhat different than the priorities of the various people who've been using my fork, which honestly is fine.
Obviously if you report the issue upstream, you should first confirm that the issue does in fact also occur in the upstream version. The upstream version is still only Python 2, so you can follow the instructions for setting up a Python 2 virtual environment in my pull request here. (I honestly don't know if these instructions are applicable for Windows, though.)
Regarding doing a pull request here: because I don't actually fully understand the inner workings of wikiteam3
(I just haven't taken the time), and because I don't really work in Python most of the time, I consider my role here as largely coordinating other contributors and, to an extent, doing my best to troubleshoot issues when the people who bring them to my attention can't fix them themselves. On this note, I've gone ahead and added you as a collaborator on this repository, which means you can create branches, and you can approve other people's pull requests to the main branch, python3
. I see that I've already added your name to the main README, so thank you again for your prior contributions!
If part of what you're asking about with pointing you in the right direction is the more general process of contributing, and you're not terribly familiar with Git and the like, I would recommend GitHub Desktop and Visual Studio Code since they're both maintained by Microsoft and their GitHub subsidiary, so they do help streamline the process, and Microsoft has documentation for the GitHub extension for VS Code, which I haven't really used but which nonetheless may be helpful. GitHub also has an extremely high-level overview of the process which could be of some help. (I should probably put all of this in the main README.)
Note: while looking up the Conda instructions to link them above, I came across this pull request upstream that may be related and seems like something we should look into regardless!
Doing a quick search for the string is wrong
shows that the error message In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading...
is coming from page_xml.py
, so that's probably a good place to start?
It seems that the printout is being triggered by the expression re.search(r"</mediawiki>", xml)
returning False
, so adding print(xml)
(or sys.exit(xml)
) immediately after the above printout, i.e. on line 35, to print out the XML string missing the substring </mediawiki>
(and optionally abort, if you use sys.exit(xml)
instead) could help diagnose the problem.
I also see that your printout starts with the following:
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers #
# This program is free software: you can redistribute it and/or modify #
...
And I know that I actually updated this printout (in part because I thought the malformed #
box looked tacky lol), so I consequently know that you are running an outdated version of wikiteam3
. Because of this, could you pull the latest version, install it, and then try the same command again? Assuming you're not in a fork, you can do the following in the directory for the cloned repository:
git pull && git checkout python3
(You may see an error if you're already in python3
.)
If you're in a fork, I would recommend deleting the cloned repository, accepting the collaboration invite I sent you, then cloning
elsiehupp/wikiteam3
directly instead. It's possible to add or change Git remotes, but re-cloning makes things simpler, especially when communicating here.
Then:
pip install --force-reinstall dist/*.whl
If you're going to start making changes for a pull request, after doing the above, do the following:
git branch your-branch-name-here python3 && git checkout your-branch-name-here
(Wherein you replace your-branch-name-here
with your desired name for the branch, ideally something descriptive starting with fix-
; if you're feeling unimaginative you can just do fix-issue-22
.)
After you've made any changes to the Python code, do the following:
black .
(This just auto-formats the code; the pre-commit will do so later, but doing it preemptively just keeps the pre-commit from squawking at you.)
Then:
poetry build
pip install --force-reinstall dist/*.whl
This will mean that dumpgenerator
as running from your command line will be the version you're working on.
Alternately, if you don't want to install your WIP version, you can (instead of building and installing it) run it directly by prepending it with poetry run ...
, i.e.:
poetry run dumgenerator dumpgenerator --images --xml --curonly --delay 2 https://www.mariowiki.com/
I am currently running the command on my own computer (i.e. macOS) to check if I can reproduce the error on my end, and I will follow up if my results significantly differ from your own.
Thank you very much for all the information (and the kind words!), I appreciate it. I definitely have a clearer idea on what to do now. I'll start looking into it soon :) I'm no longer on Windows, so I don't expect to have too much trouble to follow the instructions for setting up the virtual environment.
👍 and you're welcome!
By the way, I ran dumpgenerator
with sys.exit(xml)
, and this is the output I got (though I auto-formatted the markup for legibility):
You do not have permission to edit this page, for the following reason:
You can view and copy the source of your edits to this page.
Return to Main Page.
I'm trying to run dumpgenerator
with sys.exit(xml)
and https://wiki.archiveteam.org/
for comparison (since I that should work), but it's being slow on my end, so I'm still waiting for the output from that to complete.
Off the top of my head, I'm noticing the fact that the output is HTML, not XML, which would probably account for the lack of the </mediawiki>
XML tag! So my guess here is that the API URL for the Mario Wiki is somehow incorrect.
If you want to print out the output even if it's correct (i.e. for comparison), you can add the following to line 81 (i.e. after xml = fixBOM(r)
):
sys.exit("index URL " + config["index"] + " returns: \n\n" + xml)
This should show you both the XML and the URL it has been loaded from.
EDIT: fixed the Python code. Oops!
I identified the issue: mwGetAPIAndIndex
in api.py
parses the HTML of the main page to get the index url, using the view source button as reference. For ssbwiki and mariowiki, the view source button sends you to https://www.ssbwiki.com/Main_Page?action=edit and https://www.mariowiki.com/Main_Page?action=edit. This is odd because the Main page button sends you to https://www.ssbwiki.com/ and https://www.mariowiki.com/. The /Main_Page
urls automatically redirect you to this raw/shortened form.
For comparison, the way the Archiveteam wiki works is that the Main page buttons sends you to https://wiki.archiveteam.org/index.php/Main_Page while the view source button sends you to https://wiki.archiveteam.org/index.php?title=Main_Page&action=edit, so the index
variable is set to https://wiki.archiveteam.org/index.php. There are no redirects, although the raw https://wiki.archiveteam.org/ url and the https://wiki.archiveteam.org/index.php url work just as well as https://wiki.archiveteam.org/index.php/Main_Page
In order for the rest of the program to work, the index
variable should have actually been set to https://www.ssbwiki.com/index.php
Indeed, we then use the index
variable to construct https://www.ssbwiki.com/Main_Page?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 , which does not return XML, which causes the program to crash (https://www.ssbwiki.com?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 does not work either). The correct link is https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1.
This working url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location - the closest I can think of would be the log in button, but that might break other wikis if we changed this behavior.
So, the reason these wikis break is because their main page is set up in a way that the program does not expect. I am not sure how to programmatically fix the issue, because I don't know what is the best way to find out the actual index url. I have edited the original issue because my initial assumption was wrong - although I suppose the redirection could be indirectly related in some way.
To bypass the issue, we can pass --index https://www.ssbwiki.com/index.php
as an argument.
The README does state "If the script can't find itself the API and/or index.php paths, then you can provide them", so this may not be an uncommon bug. Perhaps we could at the very least handle the error more gracefully and suggest the user manually adds the --index
argument (but then again, I'm not sure how to instruct them to find the proper index url)
I have not yet tried to reproduce the issue on the upstream wikiteam version. I'll try to keep working on this issue, hopefully I'll find a way to fix it. I think this might be something worth asking the upstream maintainers, who are probably more familiar with the way wikis work in general.
$ dumpgenerator --xml --curonly --delay 2 --api https://www.ssbwiki.com/api.php
gets further than
$ dumpgenerator --xml --curonly --delay 2 https://www.ssbwiki.com/
(Let me know if I should instead open an issue on the upstream wikiteam project)
wikiteam3 cannot be used on certain wikis. Affected wikis include https://www.ssbwiki.com/ and https://www.mariowiki.com/.
The issue seems to stem form that fact that wikiteam3 first attempts to check the /wiki/Main_Page article, but this page does not exist on some wikis. For some reason, some wikis are set up in such a way that the main page is simply https://www.mariowiki.com/. Note that https://www.mariowiki.com/Main_Page does redirect to https://www.mariowiki.com/. Also note that contrary to most wikis, the affected wikis do not use the /wiki/ prefix for articles: https://www.ssbwiki.com/Mega_Man_2 https://www.mariowiki.com/Donkey_Kong_(game) - perhaps this is actually the root cause of the issue.I am willing to attempt a PR if you could perhaps point me in the right direction!
Logs: