Closed yzqzss closed 1 year ago
@elsiehupp
https://github.com/WikiTeam/wikiteam/pull/436
This BUG has not been reproduced in the current saveweb:test
branch, maybe we have avoided/solved this BUG. I am not sure whether to merge it.
@Pokechu22 Sorry to bother you, but can you provide more information on how to reproduce this bug?
I tested using:
"--api", "https://fr.wikiversity.org/w/api.php",
"--xml",
"--namespace", "11",
"--force"
and interrupting with Ctrl+C
after it got some pages XML.
Then rerun the command, resuming works well.
(Linux Mint 21.1
, python 3.10
)
Those instructions are correct; if you e.g. interrupt after it outputs Discussion modèle:Apprentissage, 6 edits
, when you resume it should again download Discussion modèle:Apprentissage, 6 edits
. Prior to my fix, it would instead give a UnicodeWarning
and then assume all pages had been downloaded.
It's possible/likely that this issue just can't exist like that in python 3, so you don't need to make any changes to fix it.
To reproduce the second commit, though, you need to follow the same steps, and then get lucky with when you hit Ctrl+C so that the file ends with a </revision>
instead of a </page>
. When resuming, it previously added a new <page>
tag immediately, even if that would result in </revision><page>
; now, the whole previous <page>
tag will be removed and a new one created. (If the file did end with a </page>
, then you can get a second <page>
with the same contents, but that's not really a problem IMO).
@Pokechu22
Thanks for your reply, I tried to edit the XML so that it ends with a </revision>
(and any other position of the last <page>
). The truncateXMLDump()
correctly removes the last <page>
block.
Then I checked the git blame, and found that https://github.com/elsiehupp/wikiteam3/pull/5 fixed the problem.
"If the file did end with a
</page>
, then you can get a second<page>
with the same contents"
I pushed a commit to fix this little bug: truncate the XML to a <page> with a <title>
With dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api url_to_api.php
saveweb:test worked fine to backup a couple of smaller wikis.
`
$ dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api https://simpleelectronics.miraheze.org/w/api.php
Checking API... https://simpleelectronics.miraheze.org/w/api.php
API is OK: https://simpleelectronics.miraheze.org/w/api.php
Checking index.php... https://simpleelectronics.miraheze.org/w/index.php
index.php is OK
No --path argument provided. Defaulting to:
[working_directory]/[domain_prefix]-[date]-wikidump
Which expands to:
./simpleelectronicsmirahezeorg_w-20230104-wikidump
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/elsiehupp/wikiteam3 #
#########################################################################
#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers #
# #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing https://simpleelectronics.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
28 namespaces found
15 titles retrieved in the namespace 0
1 titles retrieved in the namespace 1
5 titles retrieved in the namespace 2
1 titles retrieved in the namespace 3
7 titles retrieved in the namespace 4
0 titles retrieved in the namespace 5
18 titles retrieved in the namespace 6
0 titles retrieved in the namespace 7
33 titles retrieved in the namespace 8
0 titles retrieved in the namespace 9
89 titles retrieved in the namespace 10
0 titles retrieved in the namespace 11
15 titles retrieved in the namespace 12
0 titles retrieved in the namespace 13
17 titles retrieved in the namespace 14
0 titles retrieved in the namespace 15
0 titles retrieved in the namespace 3000
0 titles retrieved in the namespace 3001
1 titles retrieved in the namespace 3002
0 titles retrieved in the namespace 3003
38 titles retrieved in the namespace 828
0 titles retrieved in the namespace 829
0 titles retrieved in the namespace 710
0 titles retrieved in the namespace 711
0 titles retrieved in the namespace 2300
0 titles retrieved in the namespace 2301
0 titles retrieved in the namespace 2302
0 titles retrieved in the namespace 2303
Titles saved at... simpleelectronicsmirahezeorg_w-20230104-titles.txt
240 page titles loaded
https://simpleelectronics.miraheze.org/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
28 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Resistor, 7 edits (--xmlrevisions))
Capacitor, 4 edits (--xmlrevisions))
WikiNode, 3 edits (--xmlrevisions))
Main Page, 8 edits (--xmlrevisions))
Tools, 6 edits (--xmlrevisions))
snipped
Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... simpleelectronicsmirahezeorg_w-20230104-history.xml
)Retrieving image filenames
Using API to retrieve image names...
Using API:Allimages to get the list of images
Found 18 images
Sorting image filenames
18 image names loaded
Image filenames and URLs saved at... simpleelectronicsmirahezeorg_w-20230104-images.txt
Retrieving images from "start"
Creating "./simpleelectronicsmirahezeorg_w-20230104-wikidump/images" directory
-> Downloaded 10 images
-> Downloaded 18 images
Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json
---> Congratulations! Your dump is complete <---
If you found any bug, report a new issue here:
https://github.com/WikiTeam/wikiteam/issues
If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
https://github.com/WikiTeam/wikiteam
Good luck! Bye!
With
dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api url_to_api.php
saveweb:test worked fine to backup a couple of smaller wikis. `
Found a BUG: --xmlrevision
seems to ignore --resume
, which actually re-downloads pages from beginning.
Draft PR, try it by:
poetry install
&&poetry run dumpgenerator
or:
poetry install && poetry build
&&pip install --force-reinstall dist/*.whl
I will push more commits to the PR.