mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Doesn't authenticate on private wiki #118

Closed robkam closed 1 year ago

robkam commented 1 year ago

Windows 10, Git Bash, Python 3.11.1

$ dumpgenerator  --xml --images --user USER --pass PASSWORD --api https://scruffy.miraheze.org/w/api.php  --index https://scruffy.miraheze.org/wiki/index.php 
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK: https://scruffy.miraheze.org/w/api.php
Checking index.php... https://scruffy.miraheze.org/wiki/Index.php
ERROR: This wiki requires login and we are not authenticated
Error in index.php.
Please, provide a correct path to index.php or use --xmlrevisions. Terminating.
robkam commented 1 year ago
$ dumpgenerator  --xml --images --user USER --pass PASSWORD --api https://scruffy.miraheze.org/w/api.php  --index https://scruffy.miraheze.org/wiki/index.php
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK:  https://scruffy.miraheze.org/w/api.php
Trying to log in to the wiki using clientLogin... (MW 1.27+)
client login: Success! Welcome, Xyzzy!
-- Login OK --
Checking index.php... https://scruffy.miraheze.org/w/index.php
ERROR: This wiki requires login and we are not authenticated
Error in index.php.
Please, provide a correct path to index.php or use --xmlrevisions. Terminating.

or

$ dumpgenerator  --xml --xmlrevisions --images --user USER --pass PASSWORD --api https://scruffy.miraheze.org/w/api.php  --index https://scruffy.miraheze.org/wiki/index.php 
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK:  https://scruffy.miraheze.org/w/api.php
Trying to log in to the wiki using clientLogin... (MW 1.27+)
client login: Success! Welcome, Xyzzy!
-- Login OK --
Checking index.php... https://scruffy.miraheze.org/wiki/index.php
ERROR: This wiki requires login and we are not authenticated
Error in index.php.
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./scruffymirahezeorg_w-20230215-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://scruffy.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
https://scruffy.miraheze.org/w/api.php
Getting the XML header from the API
Export test via the API failed. Wiki too old? Trying without xmlrevisions.
https://scruffy.miraheze.org/w/api.php
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 115, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 128, in createNewDump
    generateXMLDump(config=config, session=other["session"])
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 96, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 124, in getXMLHeader
    header, config = getXMLHeader(config=config, session=session)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 70, in getXMLHeader
    [
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 70, in <listcomp>
    [
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlexport\page_xml_export.py", line 117, in getXMLPageWithExport
    xml = getXMLPageCore(params=params, config=config, session=session)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlexport\page_xml_export.py", line 76, in getXMLPageCore
    r = session.post(
        ^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 635, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\utils\user_agent.py", line 324, in newrequest
    return session._orirequest(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 573, in request
    prep = self.prepare_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 484, in prepare_request
    p.prepare(
  File "C:\Python\Lib\site-packages\requests\models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "C:\Python\Lib\site-packages\requests\models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'None': No scheme supplied. Perhaps you meant https://None?
robkam commented 1 year ago

Still fails.

robkam commented 1 year ago

This might be something else. The wiki was been dumped okay on the 27th Jan, by turning off the privacy, now doing that I get:

$ dumpgenerator  --xml --xmlrevisions --images --api https://scruffy.miraheze.org/w/api.php

\<snipped>

Trying to export all revisions from namespace -1
Trying to get wikitext from the allrevisions API and to build the XML
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 115, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 128, in createNewDump
    generateXMLDump(config=config, session=other["session"])
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 137, in generateXMLDump
    doXMLRevisionDump(config, session, xmlfile, lastPage, useAllrevisions=True)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 25, in doXMLRevisionDump
    for xml in getXMLRevisions(config=config, session=session, lastPage=lastPage, useAllrevision=useAllrevisions):
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlrev\xml_revisions.py", line 67, in getXMLRevisionsByAllRevisions
    arvrequest = site.api(
                 ^^^^^^^^^
  File "C:\Python\Lib\site-packages\mwclient\client.py", line 288, in api
    if self.handle_api_result(info, sleeper=sleeper):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\mwclient\client.py", line 331, in handle_api_result
    raise errors.APIError(info['error']['code'],
mwclient.errors.APIError: ('readapidenied', 'You need read permission to use this module.', 'See https://scruffy.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.')
yzqzss commented 1 year ago

Can you provide the error message of the login process ?

miraheze.org blocked my IP. :-(

robkam commented 1 year ago

The error messages are above. By the way on this wiki main page is a redirect to the sandbox.

robkam commented 1 year ago

The wiki is MediaWiki 1.39.1

$ dumpgenerator --xml --xmlrevisions https://scruffy.miraheze.org --user USER --pass PASSWORD
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK:  https://scruffy.miraheze.org/w/api.php
Trying to log in to the wiki using clientLogin... (MW 1.27+)
client login: Success! Welcome, Xyzzy!
-- Login OK --
Checking index.php... https://scruffy.miraheze.org/w/index.php
ERROR: This wiki requires login and we are not authenticated
Error in index.php.
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./scruffymirahezeorg_w-20230216-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://scruffy.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
https://scruffy.miraheze.org/w/api.php
Getting the XML header from the API
Export test via the API failed. Wiki too old? Trying without xmlrevisions.
https://scruffy.miraheze.org/w/api.php
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 115, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\generator.py", line 128, in createNewDump
    generateXMLDump(config=config, session=other["session"])
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_dump.py", line 96, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 124, in getXMLHeader
    header, config = getXMLHeader(config=config, session=session)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 70, in getXMLHeader
    [
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\xmldump\xml_header.py", line 70, in <listcomp>
    [
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlexport\page_xml_export.py", line 117, in getXMLPageWithExport
    xml = getXMLPageCore(params=params, config=config, session=session)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\dump\page\xmlexport\page_xml_export.py", line 76, in getXMLPageCore
    r = session.post(
        ^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 635, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\utils\user_agent.py", line 324, in newrequest
    return session._orirequest(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 573, in request
    prep = self.prepare_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\requests\sessions.py", line 484, in prepare_request
    p.prepare(
  File "C:\Python\Lib\site-packages\requests\models.py", line 368, in prepare
    self.prepare_url(url, params)
  File "C:\Python\Lib\site-packages\requests\models.py", line 439, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'None': No scheme supplied. Perhaps you meant https://None?
robkam commented 1 year ago

I've put in a request at Miraheze T10511 for your IP to be unblocked.

The answer I get is "... this task would be for Stewards and not for SRE/Phabricator. And otherwise, without knowing the specific IP (it can be sent via email) we are unable to assist." The email address is stewards(at)miraheze.org

robkam commented 1 year ago

After I'd was logged in to the same private wiki as above, I then used the same username and password with MediaWiki Scraper. It authenticated and dumped the wiki okay. Also after I'd logged out of the wiki and tried again with MediaWiki Scraper, it still authenticates and dumps the wiki.

Either the problem has been fixed or instructions to first login need to be added to the usage.

robkam commented 1 year ago
$ dumpgenerator --xml --xmlrevisions --api https://scruffy.miraheze.org/w/api.php --user USER --pass PASSWORD
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK:  https://scruffy.miraheze.org/w/api.php
Trying to log in to the wiki using clientLogin... (MW 1.27+)
client login: Success! Welcome, USER!
-- Login OK --
Checking index.php... https://scruffy.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./scruffymirahezeorg_w-20230826-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################
[snipped the rest]
robkam commented 1 year ago

When the login used has at least read permission on the wiki the script will authenticate.