Open malfonsi opened 8 years ago
This seems like it should work. The Mediawiki API is returning some results (as shown by the version query and the "list of pages..." not returning errors.
Maybe there is something in the older mediawiki install that causes it to fail. I don't know if updating it to a newer version is an option.
Unfortunately I'm not able to provide much support for yamdwe any more, as advertised on the front page I'm looking for a new maintainer. Good luck finding the problem though!
Angus
I'm getting this same error on mediawiki 1.21.1.
I seem to have fixed it by changing the URL I give to the script. I was doing
python yamdwe.py http://localhost/mediawiki-1.21.1/ /var/www/dokuwiki/
by changing it to
python yamdwe.py http://localhost/mediawiki-1.21.1/api.php /var/www/dokuwiki/
It then worked. I'm not sure if I missed something in the documentation telling you to put api.php, if it wasn't documented or if newer versions of mediawiki don't need you to put api.php?
@colinsauze Thanks for following up. The docs do say to use /api.php (and in the original issue post from @malfonsi you'll see that /api.php was in the URL, so it's a different issue with a similar symptom.)
However, people seem to miss this point in the docs a lot (which is a fair enough) so I just added a warning message if the URL doesn't end in api.php, and also a clearer error message if a non-JSON response comes back.
@malfonsi I don't think any of this will help fix the problem you're seeing, unfortunately. For some reason the api.php on your wiki send back JSON for the first few requests, then returned some kind of non-JSON message.
@malfonsi I just added a --verbose
option, that will print the exact invalid content if non-JSON content is returned. Can you rerun with this option and see what you see?
(It may be a lot of output depending on what the wiki is doing!)
@projectgus Thanks for the extra command option. It helped to narrow down the source of the problem.
Basically the script crash on the first page with this special sequence of characters (the extra information on the use of these characters comes from googling):
Raw encoding (hex) | UTF-8 encoding | HTML entity
0xE2,0x80,0x8E
| ‎
| Left-to-right mark
(sorry if the table above does not show right)
The funny parenthesis is that this is really present in the "wiki source code" (annoyingly invisible on simple text editors, but I "unveiled" them by copying & pasting the text on emacs and changing to ascii encoding). I have no clue how we did insert this special sequence in our pages ... maybe by copy & pasting from other web pages. Anyway let's close the parenthesis.
Anyway, is there any option to catch and digest (ignore would be fine for me) this or similar sequences? I have only basic knowledge of python syntax and I am not familiar with any of the used modules, but maybe you can address me to the module that is making the conversion and I can try to go through the code (I have found few suggestions on stackoverflow by googling the error message).
Thanks in advance for any additional hint.
P.S. I attach below the error message in case my interpretation was completely wrong:
Converting 22 revisions of page '02 April 2012'...
Traceback (most recent call last):
File "yamdwe.py", line 93, in <module>
main()
File "yamdwe.py", line 65, in main
exporter.write_pages(pages)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/dokuwiki.py", line 41, in write_pages
self._convert_page(page)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/dokuwiki.py", line 97, in _convert_page
content = wikicontent.convert_pagecontent(full_title, revision["*"])
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 68, in convert_pagecontent
result = convert(root, context, False)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 90, in convert
return convert_children(node, context)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 94, in convert
return convert_children(node, context) + "\n"
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 204, in convert
converted_list = convert_children(itemlist, context)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 210, in convert
item_content = convert_children(item, context)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 281, in convert
return convert_children(node, context)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 193, in convert
return "{{%s%s}}" % (filename, caption)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
Hi @malfonsi,
That's a very helpful update. Weird that the error message seems to have changed totally now, but well done tracking down the the specific problem with that Unicode sequence.
I've just made an update that may solve this problem by allowing link filenames and captions to be unicode instead of just ASCII (Python 2 is really painful in its ASCII vs Unicode handling, I wish we could use Python 3 for yamdwe but the Mediawiki library mwlib doesn't support it yet.)
Please let me know if that update improves things.
Angus
Dear @projectgus,
sorry for the long silence, but I tried something by myself. The patch does not work, but I am now convinced that there is something wrong in my virtualenv (or in the Debian 8 environment at all).
In fact I added these two lines to the main script "yamdwe.py", just after the main def:
def main():
print("Start of the program")
print("Stupid sequence \342\200\216")
and I get the error:
File "yamdwe.py", line 21, in main
print("Stupid sequence \342\200\216")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128)
I made a shorter script "test.py":
#from __future__ import print_function, unicode_literals, absolute_import, division
#import argparse, sys, codecs, locale, getpass, datetime
#from pprint import pprint
#import mediawiki, dokuwiki, wikicontent
def main():
print("Start")
print(u"Stupid string\u200e")
stupid = "Stupid string\342\200\216"
stupid2 = stupid.decode("utf-8")
print ( stupid2[13] )
print( len (stupid2.replace(u"\u200e","") ) )
main()
which works UNTIL I KEEP COMMENTED THE IMPORTATION OF ALL THE MODULES. You can see there my attempt to get rid of this special html entity, because it cannot be part of a filename (if you go back to my previous post you see that the initial problem was while getting the name of the uploaded media)
The error seems to come from one of the modules (again, I am not a python expert, I can only guess):
File "test.py", line 14, in main
stupid2 = stupid.decode("utf-8")
File "/localscratch/TestMediawikiToDokuwiki/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-15: ordinal not in range(128)
narrowing down the problem, it seems connected to the importation of the symbol unicode_literals
, i.e. this script works as expected if I do not import it.
Anyway: OPTION A: do you suggest to recreate another virtualenv? Please take into account that I created the virtualenv following the "Alternative Installation for Debian/Ubuntu Linux" directions (the system is a Debian 8). I can try to follow the full virtualenv installation (the paragraph just before), but I cannot try this before next year OPTION B: do you have any suggestion to work around this problem with the current virtualenv? At the end this conversion is once forever so also a quick and dirty solution would be fine.
Thanks again, Matteo
Something to add:
I found this page that can maybe be useful: http://python-future.org/unicode_literals.html
but with my python experience I need some time to digest it...
I managed to solve the issue. Basically: 1) I removed the import of the module "unicode_literals" everywhere. I am not sure if this is strictly needed, but it was easier to me to realize the workaround in (2) 2) for each page content, I remove this special character '\u200e' by using the replace method of strings, immediately after reading the content and before starting the interpretation of the content with the mwlib module
I think that this is something very specific to my case - and I admit that I have not really understood the real source of the problem, but rather I have just found a workaround for a tool that I need only once. Therefore @projectgus you are not probably interested to include any modification from my side, but just let me know if you are on a different advise.
Hey. I had the same error with the ascii codec stuff.
Executing
export LC_ALL=C
pip install --upgrade setuptools
helped. I don't know what this command does to python, so be careful.
(This is more a "help request" rather than an issue report, but I don't know how to communicate with the authors otherwise)
I am trying to convert an old Mediawiki (detected as v. 1.15.5-1) to the last version Dokuwiki "Detritus" on a Debian 8 system. I am not sure if I correctly followed the directions in the README.
The Dokuwiki installation was just installed, using the "install.php" script as suggested. I kept it publicly open and I checked that content can be added without problems (I basically edited the start page without logging in). I have write access (I mean as unix user) to these directories.
To run yamdwe, I set up a virtual environment as described. However the script crash after the message "Query page revisions (this may take a while)..."
You can see below the output:
Please let me know if you need additional information or you want to perform other tests