mediawiki-client-tools / mediawiki-dump-generator

Python 3 tools for downloading and preserving wikis
https://github.com/mediawiki-client-tools/mediawiki-scraper
GNU General Public License v3.0
89 stars 14 forks source link

Try to keep up with upstream, and other improvements. (Part 1) #49

Closed yzqzss closed 1 year ago

yzqzss commented 1 year ago

Draft PR, try it by:

poetry install && poetry run dumpgenerator

or:

poetry install && poetry build && pip install --force-reinstall dist/*.whl


I will push more commits to the PR.

robkam commented 1 year ago

@elsiehupp

yzqzss commented 1 year ago

https://github.com/WikiTeam/wikiteam/pull/436

This BUG has not been reproduced in the current saveweb:test branch, maybe we have avoided/solved this BUG. I am not sure whether to merge it.


@Pokechu22 Sorry to bother you, but can you provide more information on how to reproduce this bug?

I tested using:

"--api", "https://fr.wikiversity.org/w/api.php",
"--xml",
"--namespace", "11",
"--force"

and interrupting with Ctrl+C after it got some pages XML.

Then rerun the command, resuming works well.

(Linux Mint 21.1, python 3.10)

Pokechu22 commented 1 year ago

Those instructions are correct; if you e.g. interrupt after it outputs Discussion modèle:Apprentissage, 6 edits, when you resume it should again download Discussion modèle:Apprentissage, 6 edits. Prior to my fix, it would instead give a UnicodeWarning and then assume all pages had been downloaded.

It's possible/likely that this issue just can't exist like that in python 3, so you don't need to make any changes to fix it.

To reproduce the second commit, though, you need to follow the same steps, and then get lucky with when you hit Ctrl+C so that the file ends with a </revision> instead of a </page>. When resuming, it previously added a new <page> tag immediately, even if that would result in </revision><page>; now, the whole previous <page> tag will be removed and a new one created. (If the file did end with a </page>, then you can get a second <page> with the same contents, but that's not really a problem IMO).

yzqzss commented 1 year ago

@Pokechu22
Thanks for your reply, I tried to edit the XML so that it ends with a </revision> (and any other position of the last <page>). The truncateXMLDump() correctly removes the last <page> block.

Then I checked the git blame, and found that https://github.com/elsiehupp/wikiteam3/pull/5 fixed the problem.


"If the file did end with a </page>, then you can get a second <page> with the same contents"

I pushed a commit to fix this little bug: truncate the XML to a <page> with a <title>

robkam commented 1 year ago

With dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api url_to_api.php saveweb:test worked fine to backup a couple of smaller wikis. `

robkam commented 1 year ago
$ dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api https://simpleelectronics.miraheze.org/w/api.php
Checking API... https://simpleelectronics.miraheze.org/w/api.php                                                           
API is OK: https://simpleelectronics.miraheze.org/w/api.php                                                                
Checking index.php... https://simpleelectronics.miraheze.org/w/index.php                                                   
index.php is OK                                                                                                            
No --path argument provided. Defaulting to:                                                                                
  [working_directory]/[domain_prefix]-[date]-wikidump                                                                      
Which expands to:                                                                                                          
  ./simpleelectronicsmirahezeorg_w-20230104-wikidump                                                                       
#########################################################################                                                  
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)             #                                                  
# More info at: https://github.com/elsiehupp/wikiteam3                  #                                                  
#########################################################################                                                  

#########################################################################                                                  
# Copyright (C) 2011-2023 WikiTeam developers                           #                                                  
#                                                                       #                                                  
# This program is free software: you can redistribute it and/or modify  #                                                  
# it under the terms of the GNU General Public License as published by  #                                                  
# the Free Software Foundation, either version 3 of the License, or     #                                                  
# (at your option) any later version.                                   #                                                  
#                                                                       #                                                  
# This program is distributed in the hope that it will be useful,       #                                                  
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #                                                  
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #                                                  
# GNU General Public License for more details.                          #                                                  
#                                                                       #                                                  
# You should have received a copy of the GNU General Public License     #                                                  
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #                                                  
#########################################################################                                                  

Analysing https://simpleelectronics.miraheze.org/w/api.php                                                                 
Trying generating a new dump into a new directory...                                                                       
Loading page titles from namespaces = all                                                                                  
Excluding titles from namespaces = None                                                                                    
28 namespaces found                                                                                                        
    15 titles retrieved in the namespace 0                                                                                 
    1 titles retrieved in the namespace 1                                                                                  
    5 titles retrieved in the namespace 2                                                                                  
    1 titles retrieved in the namespace 3                                                                                  
    7 titles retrieved in the namespace 4                                                                                  
    0 titles retrieved in the namespace 5                                                                                  
    18 titles retrieved in the namespace 6                                                                                 
    0 titles retrieved in the namespace 7                                                                                  
    33 titles retrieved in the namespace 8                                                                                 
    0 titles retrieved in the namespace 9                                                                                  
    89 titles retrieved in the namespace 10                                                                                
    0 titles retrieved in the namespace 11                                                                                 
    15 titles retrieved in the namespace 12                                                                                
    0 titles retrieved in the namespace 13                                                                                 
    17 titles retrieved in the namespace 14                                                                                
    0 titles retrieved in the namespace 15                                                                                 
    0 titles retrieved in the namespace 3000                                                                               
    0 titles retrieved in the namespace 3001                                                                               
    1 titles retrieved in the namespace 3002                                                                               
    0 titles retrieved in the namespace 3003                                                                               
    38 titles retrieved in the namespace 828                                                                               
    0 titles retrieved in the namespace 829                                                                                
    0 titles retrieved in the namespace 710                                                                                
    0 titles retrieved in the namespace 711                                                                                
    0 titles retrieved in the namespace 2300                                                                               
    0 titles retrieved in the namespace 2301                                                                               
    0 titles retrieved in the namespace 2302                                                                               
    0 titles retrieved in the namespace 2303                                                                               
Titles saved at... simpleelectronicsmirahezeorg_w-20230104-titles.txt                                                      
240 page titles loaded                                                                                                     
https://simpleelectronics.miraheze.org/w/api.php                                                                           
Getting the XML header from the API                                 

Retrieving the XML for every page from the beginning                                                                       

28 namespaces found                                                                                                        
Trying to export all revisions from namespace 0                                                                            
Trying to get wikitext from the allrevisions API and to build the XML                                                      
Resistor, 7 edits (--xmlrevisions))                                                                                        
Capacitor, 4 edits (--xmlrevisions))                                                                                       
WikiNode, 3 edits (--xmlrevisions))                                                                                        
Main Page, 8 edits (--xmlrevisions))                                                                                       
Tools, 6 edits (--xmlrevisions))    

snipped

Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... simpleelectronicsmirahezeorg_w-20230104-history.xml
)Retrieving image filenames
Using API to retrieve image names...
Using API:Allimages to get the list of images
    Found 18 images
Sorting image filenames
18 image names loaded
Image filenames and URLs saved at... simpleelectronicsmirahezeorg_w-20230104-images.txt
Retrieving images from "start"
Creating "./simpleelectronicsmirahezeorg_w-20230104-wikidump/images" directory

->  Downloaded 10 images

->  Downloaded 18 images

Downloading index.php (Main Page) as index.html
Downloading Special:Version with extensions and other related info
Downloading site info as siteinfo.json

---> Congratulations! Your dump is complete <---

If you found any bug, report a new issue here:
  https://github.com/WikiTeam/wikiteam/issues

If this is a public wiki, please, consider publishing this dump.
Do it yourself as explained in:
  https://github.com/WikiTeam/wikiteam/wiki/Tutorial#Publishing_the_dump
Or contact us at:
  https://github.com/WikiTeam/wikiteam

Good luck! Bye!                                                                                                                 
yzqzss commented 1 year ago

With dumpgenerator --delay 0.0 --failfast --xml --xmlrevisions --images --api url_to_api.php saveweb:test worked fine to backup a couple of smaller wikis. `

Found a BUG: --xmlrevision seems to ignore --resume, which actually re-downloads pages from beginning.