mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
95 stars 14 forks source link

Branch python3 dumpgenerator.py fails on private wiki. #40

Closed robkam closed 1 year ago

robkam commented 1 year ago

Using Git Bash on Windows 10 with Python 3.11.1 from Microsoft Store, and python3 branch. Substituting fake user and pass here:

$ dumpgenerator --delay 0.0 --failfast --xml --images --xmlrevisions --user USER --pass PASSWORD --api=https://scruffy.miraheze.org/w/api.php
Checking API... https://scruffy.miraheze.org/w/api.php
MediaWiki API seems to work but returned no index URL
API is OK: https://scruffy.miraheze.org/w/api.php
Checking index.php... https://scruffy.miraheze.org/w/index.php
ERROR: This wiki requires login and we are not authenticated
Error in index.php.
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./scruffymirahezeorg_w-20221216-wikidump
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://scruffy.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Error: could not get namespaces from the API request.
HTTP 200
{"error":{"code":"readapidenied","info":"You need read permission to use this module.","*":"See https://scruffy.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."},"servedby":"mw141"}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Users\Rob\AppData\Local\[...]\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Users\Rob\AppData\Local\[...]\wikiteam3\dumpgenerator\generator.py", line 87, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Users\Rob\AppData\Local\[...]\wikiteam3\dumpgenerator\generator.py", line 98, in createNewDump
    getPageTitles(config=config, session=other["session"])
  File "C:\Users\Rob\AppData\Local\[...]\wikiteam3\dumpgenerator\page_titles.py", line 193, in getPageTitles
    for title in titles:
  File "C:\Users\Rob\AppData\Local\[...]\wikiteam3\dumpgenerator\page_titles.py", line 16, in getPageTitlesAPI
    namespaces, namespacenames = getNamespacesAPI(config=config, session=session)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable NoneType object
robkam commented 1 year ago

I logged in as USERNAME and exported miraheze.org_cookies.txt then I included --cookies in the command line and the error message changed. Also scruffymirahezeorg_w-20221220-titles.txt is an empty file.

$ dumpgenerator --cookies miraheze.org_cookies.txt --user USERNAME --pass PASSWORD --delay 0.0 --failfast --xml --images --xmlrevisions --images --api=https://scruffy.miraheze.org/w/api.php
Using cookies from miraheze.org_cookies.txt
Checking API... https://scruffy.miraheze.org/w/api.php
API is OK: https://scruffy.miraheze.org/w/api.php
Checking index.php... https://scruffy.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./scruffymirahezeorg_w-20221220-wikidump
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://scruffy.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
28 namespaces found
Traceback (most recent call last):
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\listing.py", line 55, in __next__
    item = six.next(self._iter)
           ^^^^^^^^^^^^^^^^^^^^
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Rob\AppData[...]Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Users\Rob\AppData[...]site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Users\Rob\AppData[...]site-packages\wikiteam3\dumpgenerator\generator.py", line 87, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Users\Rob\AppData[...]site-packages\wikiteam3\dumpgenerator\generator.py", line 98, in createNewDump
    getPageTitles(config=config, session=other["session"])
  File "C:\Users\Rob\AppData[...]site-packages\wikiteam3\dumpgenerator\page_titles.py", line 193, in getPageTitles
    for title in titles:
  File "C:\Users\Rob\AppData[...]site-packages\wikiteam3\dumpgenerator\page_titles.py", line 28, in getPageTitlesAPI
    for page in site.allpages(namespace=namespace):
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\listing.py", line 180, in __next__
    info = super(GeneratorList, self).__next__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\listing.py", line 61, in __next__
    self.load_chunk()
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\listing.py", line 191, in load_chunk
    return super(GeneratorList, self).load_chunk()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\listing.py", line 95, in load_chunk
    data = self.site.get(
           ^^^^^^^^^^^^^^
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\client.py", line 234, in get
    return self.api(action, 'GET', *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\client.py", line 288, in api
    if self.handle_api_result(info, sleeper=sleeper):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Rob\AppData[...]site-packages\mwclient\client.py", line 331, in handle_api_result
    raise errors.APIError(info['error']['code'],
mwclient.errors.APIError: ('readapidenied', 'You need read permission to use this module.', 'See https://scruffy.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.')
    Retrieving titles in the namespace 0
robkam commented 1 year ago

Added (On Miraheze wiki via Manage this wiki's additional settings > Permissions)

$wgWhitelistRead = [
    'Special:Export'
    ];

gives the same error as above.

robkam commented 1 year ago

This should be using OAuth.

robkam commented 1 year ago

Issue raised at https://github.com/mwclient/mwclient/issues/278