mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
97 stars 15 forks source link

'charmap' codec can't encode character '\u03bc' in position 10: character maps to <undefined> #58

Closed robkam closed 1 year ago

robkam commented 1 year ago

@yzqzss After PR #57 the XML dump fails

$ dumpgenerator  --stdout-log-file log.txt  --failfast --xml --xmlrevisions --api https://sdiywiki.miraheze.org/w/api.php
Checking API... https://sdiywiki.miraheze.org/w/api.php
API is OK: https://sdiywiki.miraheze.org/w/api.php
Checking index.php... https://sdiywiki.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./sdiywikimirahezeorg_w-20230108-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://sdiywiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
28 namespaces found
    646 titles retrieved in the namespace 0

snipped

    0 titles retrieved in the namespace 2302
    0 titles retrieved in the namespace 2303
Titles saved at... sdiywikimirahezeorg_w-20230108-titles.txt
2902 page titles loaded
https://sdiywiki.miraheze.org/w/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

28 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Learning to play a synth, 3 edits (--xmlrevisions)

snipped

CGS Serge resonant equalizer, 2 edits (--xmlrevisions)
R. A. Penfold, 1 edits (--xmlrevisions)
'charmap' codec can't encode character '\u03bc' in position 10: character maps to <undefined>
XML dump saved at... sdiywikimirahezeorg_w-20230108-history.xml
Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/elsiehupp/wikiteam3/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!
yzqzss commented 1 year ago

This problem is not caused by #57, I will check this tomorrow.

yzqzss commented 1 year ago

https://github.com/elsiehupp/wikiteam3/blob/98518297cc0ae09a58266b5acf6f040d99743d4d/wikiteam3/dumpgenerator/generator.py#L42

Replace it whith self.file = open(filename, 'w', encoding="utf-8"). If OK, I will open a PR.

This problem may only occur on windows, I can't test it.

robkam commented 1 year ago

It's still broken, sorry.

robkam commented 1 year ago

Also the log file looks a little different to stdout. Maybe revert the changes since PR #56 ?

yzqzss commented 1 year ago

Also the log file looks a little different to stdout. Maybe revert the changes since PR #56 ?

Can you provide the .log file?

robkam commented 1 year ago

In the terminal:

$ dumpgenerator  --stdout-log-file log.txt  --failfast --xml --xmlrevisions --api https://sdiywiki.miraheze.org/w/api.php
Checking API... https://sdiywiki.miraheze.org/w/api.php
API is OK: https://sdiywiki.miraheze.org/w/api.php
Checking index.php... https://sdiywiki.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./sdiywikimirahezeorg_w-20230109-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://sdiywiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
28 namespaces found
    646 titles retrieved in the namespace 0
    20 titles retrieved in the namespace 1
    19 titles retrieved in the namespace 2
    7 titles retrieved in the namespace 3
    11 titles retrieved in the namespace 4
    0 titles retrieved in the namespace 5
    1121 titles retrieved in the namespace 6
    1 titles retrieved in the namespace 7
    40 titles retrieved in the namespace 8
    4 titles retrieved in the namespace 9
    618 titles retrieved in the namespace 10
    0 titles retrieved in the namespace 11
    15 titles retrieved in the namespace 12
    0 titles retrieved in the namespace 13
    265 titles retrieved in the namespace 14
    0 titles retrieved in the namespace 15
    0 titles retrieved in the namespace 3000
    0 titles retrieved in the namespace 3001
    Retrieving titles in the namespace 3002Probably a loop, switching to next namespace
    8 titles retrieved in the namespace 3002
    0 titles retrieved in the namespace 3003
    127 titles retrieved in the namespace 828
    0 titles retrieved in the namespace 829
    0 titles retrieved in the namespace 710
    0 titles retrieved in the namespace 711
    0 titles retrieved in the namespace 2300
    0 titles retrieved in the namespace 2301
    0 titles retrieved in the namespace 2302
    0 titles retrieved in the namespace 2303
Titles saved at... sdiywikimirahezeorg_w-20230109-titles.txt
2902 page titles loaded
https://sdiywiki.miraheze.org/w/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

28 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Learning to play a synth, 3 edits (--xmlrevisions)
Drawbacks to modular synthesis, 1 edits (--xmlrevisions)
Patch, 1 edits (--xmlrevisions)
Modular synthesizer, 1 edits (--xmlrevisions)
Circuit bending, 2 edits (--xmlrevisions)
Chronology of synth DIY, 1 edits (--xmlrevisions)
Chadacre Electronics Ltd, 1 edits (--xmlrevisions)
Main Page, 2 edits (--xmlrevisions)
List of synth DIY repositories, 10 edits (--xmlrevisions)
Yamaha YM3812, 1 edits (--xmlrevisions)
Dewtron, 3 edits (--xmlrevisions)
Digisound Ltd, 1 edits (--xmlrevisions)
Doug Curtis, 1 edits (--xmlrevisions)
CEM3340, 1 edits (--xmlrevisions)
Schematics and manuals, 4 edits (--xmlrevisions)
Electronic component suppliers, 2 edits (--xmlrevisions)
KiCad PCB EDA Suite, 1 edits (--xmlrevisions)
PCB layout and design, 1 edits (--xmlrevisions)
PCB and kit suppliers, 2 edits (--xmlrevisions)
Synthesizer do it yourself, 2 edits (--xmlrevisions)
Panel (outsourcing), 8 edits (--xmlrevisions)
Open source music hardware projects, 1 edits (--xmlrevisions)
Panel (outsourcing), 8 edits (--xmlrevisions)
LM13700, 2 edits (--xmlrevisions)
List of synth DIY repositories, 1 edits (--xmlrevisions)
WikiNode, 2 edits (--xmlrevisions)
Stocking up on components, 1 edits (--xmlrevisions)
Simple synth DIY, 1 edits (--xmlrevisions)
Operational transconductance amplifier, 1 edits (--xmlrevisions)
Linear IC, 1 edits (--xmlrevisions)
CGS pulse divider and Boolean logic, 1 edits (--xmlrevisions)
CGS MOTM distribution board, 1 edits (--xmlrevisions)
Arduinome, 2 edits (--xmlrevisions)
CGS voltage controlled slope, 1 edits (--xmlrevisions)
CGS voltage controlled slope Eurorack, 2 edits (--xmlrevisions)
Thomas Henry, 1 edits (--xmlrevisions)
Yamaha PSS-470, 1 edits (--xmlrevisions)
Panel (homebrew), 1 edits (--xmlrevisions)
Modular synthesizer, 1 edits (--xmlrevisions)
PPG Wave, 1 edits (--xmlrevisions)
Eurorack parts, 1 edits (--xmlrevisions)
Maplin Electronics Ltd., 1 edits (--xmlrevisions)
Eurorack panel components, 1 edits (--xmlrevisions)
Cyndustries, 1 edits (--xmlrevisions)
Connectors, 1 edits (--xmlrevisions)
Comparison of Eurorack DIY PSUs, 1 edits (--xmlrevisions)
CatGirl Synth, 1 edits (--xmlrevisions)
Aries System 300, 1 edits (--xmlrevisions)
Blacet Research, 1 edits (--xmlrevisions)
Vacuum tube, 1 edits (--xmlrevisions)
Frac rack, 1 edits (--xmlrevisions)
Audio synthesis via vacuum tubes, 1 edits (--xmlrevisions)
CGS Serge resonant equalizer, 2 edits (--xmlrevisions)
R. A. Penfold, 1 edits (--xmlrevisions)
'charmap' codec can't encode character '\u03bc' in position 10: character maps to <undefined>
XML dump saved at... sdiywikimirahezeorg_w-20230109-history.xml
Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/elsiehupp/wikiteam3/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!

In the file:

#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://sdiywiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None

    .
    ..
    ...
    ....
    .....

28 namespaces found
    Retrieving titles in the namespace 0
    646 titles retrieved in the namespace 0

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 1
    20 titles retrieved in the namespace 1

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 2
    19 titles retrieved in the namespace 2

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 3
    7 titles retrieved in the namespace 3

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 4
    11 titles retrieved in the namespace 4

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 5
    0 titles retrieved in the namespace 5

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 6
    1121 titles retrieved in the namespace 6

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 7
    1 titles retrieved in the namespace 7

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 8
    40 titles retrieved in the namespace 8

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 9
    4 titles retrieved in the namespace 9

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 10
    618 titles retrieved in the namespace 10

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 11
    0 titles retrieved in the namespace 11

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 12
    15 titles retrieved in the namespace 12

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 13
    0 titles retrieved in the namespace 13

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 14
    265 titles retrieved in the namespace 14

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 15
    0 titles retrieved in the namespace 15

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 3000
    0 titles retrieved in the namespace 3000

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 3001
    0 titles retrieved in the namespace 3001

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 3002Probably a loop, switching to next namespace

    8 titles retrieved in the namespace 3002

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 3003
    0 titles retrieved in the namespace 3003

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 828
    127 titles retrieved in the namespace 828

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 829
    0 titles retrieved in the namespace 829

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 710
    0 titles retrieved in the namespace 710

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 711
    0 titles retrieved in the namespace 711

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 2300
    0 titles retrieved in the namespace 2300

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 2301
    0 titles retrieved in the namespace 2301

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 2302
    0 titles retrieved in the namespace 2302

    .
    ..
    ...
    ....
    .....

    Retrieving titles in the namespace 2303
    0 titles retrieved in the namespace 2303

    .
    ..
    ...
    ....
    .....

Titles saved at... sdiywikimirahezeorg_w-20230109-titles.txt
2902 page titles loaded
https://sdiywiki.miraheze.org/w/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

    .
    ..
    ...
    ....
    .....

28 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Learning to play a synth, 3 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Drawbacks to modular synthesis, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Patch, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Modular synthesizer, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Circuit bending, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Chronology of synth DIY, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Chadacre Electronics Ltd, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Main Page, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

List of synth DIY repositories, 10 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Yamaha YM3812, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Dewtron, 3 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Digisound Ltd, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Doug Curtis, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CEM3340, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Schematics and manuals, 4 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Electronic component suppliers, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

KiCad PCB EDA Suite, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

PCB layout and design, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

PCB and kit suppliers, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Synthesizer do it yourself, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Panel (outsourcing), 8 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Open source music hardware projects, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Panel (outsourcing), 8 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

LM13700, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

List of synth DIY repositories, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

WikiNode, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Stocking up on components, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Simple synth DIY, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Operational transconductance amplifier, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Linear IC, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CGS pulse divider and Boolean logic, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CGS MOTM distribution board, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Arduinome, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CGS voltage controlled slope, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CGS voltage controlled slope Eurorack, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Thomas Henry, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Yamaha PSS-470, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Panel (homebrew), 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Modular synthesizer, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

PPG Wave, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Eurorack parts, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Maplin Electronics Ltd., 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Eurorack panel components, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Cyndustries, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Connectors, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Comparison of Eurorack DIY PSUs, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CatGirl Synth, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Aries System 300, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Blacet Research, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Vacuum tube, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Frac rack, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

Audio synthesis via vacuum tubes, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

CGS Serge resonant equalizer, 2 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

R. A. Penfold, 1 edits (--xmlrevisions)

    .
    ..
    ...
    ....
    .....

'charmap' codec can't encode character '\u03bc' in position 10: character maps to <undefined>
XML dump saved at... sdiywikimirahezeorg_w-20230109-history.xml
Downloading index.php (Main Page) as index.html

    .
    ..
    ...
    ....
    .....

Downloading Special:Version with extensions and other related info

    .
    ..
    ...
    ....
    .....

Downloading site info as siteinfo.json

    .
    ..
    ...
    ....
    .....

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/elsiehupp/wikiteam3/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!
yzqzss commented 1 year ago
    .
    ..
    ...
    ....
    .....

This is expected behavior, not a bug. Because currently tee() does nothing more than simply redirect the stdout to a file by using >.

robkam commented 1 year ago

line 42 changed to self.file = open(filename, 'w', encoding="utf-8")

robkam commented 1 year ago

It works without a problem on Kubuntu. It also writes a file Special:Version.html Windows chokes on the colon in the filename. This isn't a new problem, I've only just noticed it.

robkam commented 1 year ago

dumpgenerator --failfast --xml --xmlrevisions --api https://sdiywiki.miraheze.org/w/api.php works fine on Windows 10, Git Bash and Python 3.11.1 without--stdout-log-file log.txt - it gets past the character μ.

yzqzss commented 1 year ago

dumpgenerator --failfast --xml --xmlrevisions --api https://sdiywiki.miraheze.org/w/api.php works fine on Windows 10, Git Bash and Python 3.11.1 without--stdout-log-file log.txt - it gets past the character μ.

Special:Version.html still not created du to stupid NTFS, right?

robkam commented 1 year ago

Yes, it makes an empty file named Special

The following characters are not allowed in NTFS file names:

< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
yzqzss commented 1 year ago

NTFS: / \ : * " ? < > |

Most linux fs: /

Illegal characters in page titles (MediaWiki):
# < > [ ] | { }

Illegal characters in file names (MediaWiki): : / \

Remove the / and combine all limits to get this strange file name: \:*"?<>|#<>[]|{}boom.png, and after uploading it, MediaWiki changed it to File:--*"?---------boom.png.

http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/index.php?title=File:--*"%3F---------boom.png


*"? still are illegal characters in the NTFS.

So, maybe it's not a good idea to use wikiteam to dump images on NTFS ...

yzqzss commented 1 year ago

Maybe we can rename Special:Version.html and convert all images filename from *?" to - as well, but I need to do more research to make sure the final dump can be imported properly by MediaWiki.

robkam commented 1 year ago

If I extract Special:Version.html from an archive NTFS renames it to Special_Version.html, or rename it to SpecialVersion.html.

robkam commented 1 year ago

There's some other problem, without trying to get File:--*"?---------boom.png.

$ dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/a
pi.php
Checking API... http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/api.php
API is OK: http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/api.php
Checking index.php... http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./group2mediawikidemosave_weborg_mediawiki_1391-20230111-wikidump
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
22 namespaces found
    3 titles retrieved in the namespace 0
    0 titles retrieved in the namespace 1
    0 titles retrieved in the namespace 2
    0 titles retrieved in the namespace 3
    0 titles retrieved in the namespace 4
    0 titles retrieved in the namespace 5
    1 titles retrieved in the namespace 6
    0 titles retrieved in the namespace 7
    0 titles retrieved in the namespace 8
    0 titles retrieved in the namespace 9
    0 titles retrieved in the namespace 10
    0 titles retrieved in the namespace 11
    0 titles retrieved in the namespace 12
    0 titles retrieved in the namespace 13
    0 titles retrieved in the namespace 14
    0 titles retrieved in the namespace 15
    0 titles retrieved in the namespace 828
    0 titles retrieved in the namespace 829
    0 titles retrieved in the namespace 2300
    0 titles retrieved in the namespace 2301
    0 titles retrieved in the namespace 2302
    0 titles retrieved in the namespace 2303
Titles saved at... group2mediawikidemosave_weborg_mediawiki_1391-20230111-titles.txt
4 page titles loaded
http://group2.mediawiki.demo.save-web.org/mediawiki-1.39.1/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

22 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
'*'
Traceback (most recent call last):
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\page_xml.py", line 226, in makeXmlFromPage
    text_element = E.text(str(rev["*"]), bytes=str(size))
                              ~~~^^^^^
KeyError: '*'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python\Scripts\dumpgenerator.exe\__main__.py", line 7, in <module>
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\__init__.py", line 26, in main
    DumpGenerator()
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\generator.py", line 112, in __init__
    DumpGenerator.createNewDump(config=config, other=other)
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\generator.py", line 125, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other["session"])
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\xml_dump.py", line 52, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\xml_revisions.py", line 77, in getXMLRevisions
    yield makeXmlFromPage(page)
          ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python\Lib\site-packages\wikiteam3\dumpgenerator\page_xml.py", line 256, in makeXmlFromPage
    raise PageMissingError(page["title"], e)
wikiteam3.dumpgenerator.exceptions.PageMissingError: page 'TestRevisionDelete2' not found
yzqzss commented 1 year ago

There's some other problem, without trying to get File:--*"?---------boom.png.

I forgot to mention that this is caused by another bug in --xmlrevision, and currently wikiteam3 (which also exists in upstream) can't handle the revision being hidden/deleted perfectly. There is a simple patch that skips the whole page crawl when carwler see a hidden revision, but I didn't include the patch in the 0.5.0-alpha PR because it's not a perfect solution, and I'm working on fixing it.

Screenshot_20230112-053922.png

Screenshots_2023-01-12-05-50-52.png

yzqzss commented 1 year ago

Try this, no hidden reversion on the site:

http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/index.php?title=File:-_*_%3F_%22_-_-_-.png

robkam commented 1 year ago

I wouldn't want to lose a whole page because of hidden revisions? My wikis have some hidden revisions.

-_*_%3F_%22_-_-_-.png is on the server as -_%2A_%3F_"_-_-_-.png and windows saves it as -_______-_-_-.png

yzqzss commented 1 year ago

I wouldn't want to lose a whole page because of hidden revisions? My wikis have some hidden revisions.

I did some tests, --xml (without --xmlrevisions) works fine with hidden revisions.

robkam commented 1 year ago

XML dump seems okay

$ dumpgenerator --delay 0.0 --failfast --xml --images --xmlrevisions http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/api.php
Checking API... http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/api.php
API is OK: http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/api.php
Checking index.php... http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./group1mediawikidemosave_weborg_mediawiki_1277-20230111-wikidump
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
20 namespaces found
    1 titles retrieved in the namespace 0
    0 titles retrieved in the namespace 1
    0 titles retrieved in the namespace 2
    0 titles retrieved in the namespace 3
    0 titles retrieved in the namespace 4
    0 titles retrieved in the namespace 5
    2 titles retrieved in the namespace 6
    0 titles retrieved in the namespace 7
    0 titles retrieved in the namespace 8
    0 titles retrieved in the namespace 9
    0 titles retrieved in the namespace 10
    0 titles retrieved in the namespace 11
    0 titles retrieved in the namespace 12
    0 titles retrieved in the namespace 13
    0 titles retrieved in the namespace 14
    0 titles retrieved in the namespace 15
    0 titles retrieved in the namespace 2300
    0 titles retrieved in the namespace 2301
    0 titles retrieved in the namespace 2302
    0 titles retrieved in the namespace 2303
Titles saved at... group1mediawikidemosave_weborg_mediawiki_1277-20230111-titles.txt
3 page titles loaded
http://group1.mediawiki.demo.save-web.org/mediawiki-1.27.7/api.php
Getting the XML header from the API

Retrieving the XML for every page from the beginning

20 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Main Page, 1 edits (--xmlrevisions)
Trying to export all revisions from namespace 1
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 3
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 4
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 5
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 6
Trying to get wikitext from the allrevisions API and to build the XML
File:- * ? " - - -.png, 1 edits (--xmlrevisions)
File:BF76%1R FI4`GW3SDHTFD.png, 1 edits (--xmlrevisions)
Trying to export all revisions from namespace 7
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 8
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 9
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 10
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 11
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 12
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 13
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 14
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 15
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2300
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2301
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2302
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... group1mediawikidemosave_weborg_mediawiki_1277-20230111-history.xml
)Retrieving image filenames
Using API to retrieve image names...
Using API:Allimages to get the list of images
    Found 2 images
Sorting image filenames
2 image names loaded
Image filenames and URLs saved at... group1mediawikidemosave_weborg_mediawiki_1277-20230111-images.txt
Retrieving images from "start"
Creating "./group1mediawikidemosave_weborg_mediawiki_1277-20230111-wikidump/images" directory

->  Downloaded 2 images

Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/elsiehupp/wikiteam3/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!

However one file didn't download. There's an errors.log file with:

2023-01-11 22:17:03: File ./group1mediawikidemosave_weborg_mediawiki_1277-20230111-wikidump/images/- * ? " - - -.png could not be created by OS
2023-01-11 22:17:05: File ./group1mediawikidemosave_weborg_mediawiki_1277-20230111-wikidump/images/- * ? " - - -.png.desc could not be created by OS
yzqzss commented 1 year ago

File naming involves many issues.

Also, I found that wikiteam has problems with truncating long filenames.

Again, I will discus and fix these problems later.

robkam commented 1 year ago

I tried it out with a wiki with hidden revisions and can confirm dumpgenerator --failfast --xml --api URL/api.php dumps okay, only losing the hidden revisions dumpgenerator --failfast --xml --xmlrevisions --api URL/api.php fails with PageMissingError etc.

robkam commented 1 year ago

It now gets past where it previously failed on NTFS (over "Fairchild μA726").

robkam commented 1 year ago

64 to change Special:Version.html to Special-Version.html