mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
95 stars 14 forks source link

Dumpgenerator.py completes but dump fails the grep check. #39

Closed robkam closed 1 year ago

robkam commented 1 year ago

Windows 10, Git Bash, python3 branch. The XML dump is mostly a single and far too long line. It fails the check by grep mentioned in the Wikiteam tutorial and passes when using --curonly instead of --xmlrevisions.

$ dumpgenerator --failfast --xml --xmlrevisions --api=https://sdiywiki.miraheze.org/w/api.php
Checking API... https://sdiywiki.miraheze.org/w/api.php
API is OK: https://sdiywiki.miraheze.org/w/api.php
Checking index.php... https://sdiywiki.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
  [working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
  ./sdiywikimirahezeorg_w-20221215-wikidump
--delay is the default value of 0.5
There will be a 0.5 second delay between HTTP calls in order to keep the server from timing you out.
If you know that this is unnecessary, you can manually specify '--delay 0.0'.
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #
# More info at: https://github.com/elsiehupp/wikiteam3                  #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #
#                                                                       #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://sdiywiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
28 namespaces found
    642 titles retrieved in the namespace 0
    20 titles retrieved in the namespace 1
    18 titles retrieved in the namespace 2
    6 titles retrieved in the namespace 3
    10 titles retrieved in the namespace 4
    0 titles retrieved in the namespace 5

<snipped>

        1 more revisions exported
        1 more revisions exported
        1 more revisions exported
        16 more revisions exported
        25 more revisions exported
Trying to export all revisions from namespace 829
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 710
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 711
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2300
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2301
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2302
Trying to get wikitext from the allrevisions API and to build the XML
Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... sdiywikimirahezeorg_w-20221215-history.xml
Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/WikiTeam/wikiteam/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!

$ cd sdiywikimirahezeorg_w-20221215-wikidump/

$ grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c
1
1
1
1
1
robkam commented 1 year ago

Changing all the \n characters in the XML to an actual newline seems to get it to pass the grep check.