mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Branch prepare-for-publication gives UnicodeEncodeError #37

Closed robkam closed 1 year ago

robkam commented 1 year ago

On Windows 10 with Python 3.10 (Microsoft app), both with Command Prompt and Bash

$dumpgenerator --xml --images --api=https://sdiywiki.miraheze.org/w/api.php

Checking API... https://sdiy.info/w/api.php
API is OK: https://sdiy.info/w/api.php

Checking index.php... https://sdiywiki.miraheze.org/w/index.php
index.php is OK

No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./sdiyinfo_w-20221212-wikidump

#########################################################################
# Welcome to DumpGenerator 3.0.0 by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://sdiy.info/w/api.php
Trying generating a new dump into a new directory...

(28 namespaces found)
Loading page titles from all namespaces

Retrieving titles from the API
(28 namespaces found)

(0)     Retrieving titles from root namespace
        (639 titles retrieved)

(1)     Retrieving titles from namespace "Talk"
        (19 titles retrieved)

(2)     Retrieving titles from namespace "User"
        (17 titles retrieved)

(3)     Retrieving titles from namespace "User talk"
        (6 titles retrieved)

(4)     Retrieving titles from namespace "Synth DIY Wiki"
        (10 titles retrieved)

(5)     Retrieving titles from namespace "Synth DIY Wiki talk"
        (0 titles retrieved)

(6)     Retrieving titles from namespace "File"
        (1111 titles retrieved)

(7)     Retrieving titles from namespace "File talk"
        (1 title retrieved)

(8)     Retrieving titles from namespace "MediaWiki"
        (40 titles retrieved)

(9)     Retrieving titles from namespace "MediaWiki talk"
        (4 titles retrieved)

(10)    Retrieving titles from namespace "Template"
        (618 titles retrieved)

(11)    Retrieving titles from namespace "Template talk"
        (0 titles retrieved)

(12)    Retrieving titles from namespace "Help"
        (15 titles retrieved)

(13)    Retrieving titles from namespace "Help talk"
        (0 titles retrieved)

(14)    Retrieving titles from namespace "Category"
        (263 titles retrieved)

(15)    Retrieving titles from namespace "Category talk"
        (0 titles retrieved)

(3000)  Retrieving titles from namespace "Draft"
        (0 titles retrieved)

(3001)  Retrieving titles from namespace "Draft talk"
        (0 titles retrieved)

(3002)  Retrieving titles from namespace "Boilerplate"
Probably a loop, switching to next namespace
        (8 titles retrieved)

(3003)  Retrieving titles from namespace "Boilerplate talk"
        (0 titles retrieved)

(828)   Retrieving titles from namespace "Module"
        (127 titles retrieved)

(829)   Retrieving titles from namespace "Module talk"
        (0 titles retrieved)

(710)   Retrieving titles from namespace "TimedText"
        (0 titles retrieved)

(711)   Retrieving titles from namespace "TimedText talk"
        (0 titles retrieved)

(2300)  Retrieving titles from namespace "Gadget"
        (0 titles retrieved)

(2301)  Retrieving titles from namespace "Gadget talk"
        (0 titles retrieved)

(2302)  Retrieving titles from namespace "Gadget definition"
        (0 titles retrieved)

(2303)  Retrieving titles from namespace "Gadget definition talk"
        (0 titles retrieved)

Titles saved at... sdiyinfo_w-20221212-titles.txt
2878 page titles loaded
https://sdiy.info/w/api.php

Retrieving the XML for every page from the beginning

19" rack
(1 edit)

19-inch rack
(21 edits)

3U
(12 edits)

4U
(12 edits)

555 timer
(2 edits)

5U
(21 edits)

ADSR
(1 edit)

ARP 2600
(11 edits)

AWG
(1 edit)

->  Downloaded 10 pages

Alan R. Pearlman
(15 edits)
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\wikiteam3\dumpgenerator\generator.py", line 86, in __init__
    DumpGenerator.createNewDump(config, other)
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\wikiteam3\dumpgenerator\generator.py", line 100, in createNewDump
    generateXMLDump(config, titles=titles)
  File "C:\Users\Rob\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\wikiteam3\dumpgenerator\xml_dump.py", line 91, in generateXMLDump
    xml_file.write(str(xml))
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u041a' in position 5231: character maps to <undefined>

$
yzqzss commented 1 year ago

The prepare-for-publication branch has been deprecated.