trappedinspacetime / wikiteam

Automatically exported from code.google.com/p/wikiteam
0 stars 0 forks source link

Error loop "XML for ... is wrong" #26

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Apparently this error is quite frequent with some characters. This starts a 
neverending loop, see e.g. http://p.defau.lt/?vUHNXKoaCOfNkeor_0HmCg
I removed that title from the title list and resumed the dump; the following 
pages were not downloaded, perhaps because they were invalid: 
http://p.defau.lt/?KeDck2rQZqGlp9MWmYmB_Q 
Could those (invalid?) titles be the actual problem behind the error?

Original issue reported on code.google.com by nemow...@gmail.com on 12 Jul 2011 at 6:21

GoogleCodeExporter commented 8 years ago
I don't think it is due to weird chars.

Look at the edit history for that page (is it a big history?) and try to export 
it handy using Special:Export (and open the result XML file). All is OK?

Original comment by emi...@gmail.com on 12 Jul 2011 at 8:02

GoogleCodeExporter commented 8 years ago
I tried an export with "Être et Temps", then with "Être et Temps" and all 
following titles, then only with following titles. If "Être et Temps" is 
included the xml is invalid. The history has only one, big revision (1,389,912 
bytes): http://www.wikinfo.org/index.php?title=%C3%8Atre_et_Temps&action=history
Is that size really enough to give problems? 

Original comment by nemow...@gmail.com on 13 Jul 2011 at 7:17

Attachments:

GoogleCodeExporter commented 8 years ago
If you have problems while trying to handy-export that article using 
Special:Export, then it is not a dumpgenerator.py problem. Although I have 
exported it handy without problems now. Try again, --resume. Server may be 
overloaded from time to time (and slowest ones have problems exporting huge 
revisions; PHP errors).

Original comment by emi...@gmail.com on 14 Jul 2011 at 6:52

GoogleCodeExporter commented 8 years ago
So, changing this bug: the problem is the infinite loop.
getXMLPageCore, after 5 retries, calls getXMLPageCore again for the last 
revision only, which will fail again and call getXMLPageCore again and again.

Original comment by nemow...@gmail.com on 28 May 2012 at 8:29

GoogleCodeExporter commented 8 years ago
Should be fixed by r675
Example: http://p.defau.lt/?3quwoS3nepAPnje3ro4WQw now gives 
http://p.defau.lt/?ZB8tXAcnI8c178eseWMGWA

Original comment by nemow...@gmail.com on 28 May 2012 at 9:36

GoogleCodeExporter commented 8 years ago
Issue 51 has been merged into this issue.

Original comment by nemow...@gmail.com on 12 Jun 2012 at 7:00

GoogleCodeExporter commented 8 years ago
Reopened because "fix" was reverted.

Original comment by nemow...@gmail.com on 22 Jun 2012 at 7:14

GoogleCodeExporter commented 8 years ago

Original comment by nemow...@gmail.com on 9 Nov 2012 at 10:05

GoogleCodeExporter commented 8 years ago
Still happening e.g. for http://www.editthis.info/4chan/api.php

Original comment by nemow...@gmail.com on 8 Dec 2013 at 9:11

GoogleCodeExporter commented 8 years ago
To forget about this very annoying issue, I just added sys.exit() right after 
"if c >= maxretries:" in dumpgenerator.py. Then launcher.py makes it continue 
to the next wiki of the list. Brute force workaround...

Original comment by nemow...@gmail.com on 31 Jan 2014 at 12:41

GoogleCodeExporter commented 8 years ago

Original comment by nemow...@gmail.com on 31 Jan 2014 at 3:15