projectgus / yamdwe

Yet Another Mediawiki to DokuWiki Exporter
Other
24 stars 12 forks source link

Crash when converting a MediaWiki 1.15.5-1 to a fresh "Detritus" Dokuwiki #38

Open malfonsi opened 8 years ago

malfonsi commented 8 years ago

(This is more a "help request" rather than an issue report, but I don't know how to communicate with the authors otherwise)

I am trying to convert an old Mediawiki (detected as v. 1.15.5-1) to the last version Dokuwiki "Detritus" on a Debian 8 system. I am not sure if I correctly followed the directions in the README.

The Dokuwiki installation was just installed, using the "install.php" script as suggested. I kept it publicly open and I checked that content can be added without problems (I basically edited the start page without logging in). I have write access (I mean as unix user) to these directories.

To run yamdwe, I set up a virtual environment as described. However the script crash after the message "Query page revisions (this may take a while)..."

You can see below the output:

(env) /localscratch/TestMediawikiToDokuwiki/yamdwe-master$ ./yamdwe.py --wiki_user "xxx xxx" http://yyyy/zzz/api.php /var/www/html/dokuwiki/
Enter password for Wiki login (xxx xxx):
Logging in as xxx xxx...
MediaWiki 1.15.5-1 meets version requirements.
Getting list of pages...
Query page revisions (this may take a while)...
Traceback (most recent call last):
  File "./yamdwe.py", line 89, in <module>
    main()
  File "./yamdwe.py", line 46, in main
    pages = importer.get_all_pages()
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe-master/mediawiki.py", line 42, in get_all_pages
    page["revisions"] = self._get_revisions(page)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe-master/mediawiki.py", line 52, in _get_revisions
    revisions = self._query(query, [ 'pages', str(pageid), 'revisions' ])
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe-master/mediawiki.py", line 81, in _query
    response = self.mw.call(query)
  File "/localscratch/TestMediawikiToDokuwiki/env/local/lib/python2.7/site-packages/simplemediawiki.py", line 184, in call
    return json.loads(self._fetch_http(self._api_url, params))
  File "/localscratch/TestMediawikiToDokuwiki/env/local/lib/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/localscratch/TestMediawikiToDokuwiki/env/local/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/localscratch/TestMediawikiToDokuwiki/env/local/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Please let me know if you need additional information or you want to perform other tests

projectgus commented 8 years ago

This seems like it should work. The Mediawiki API is returning some results (as shown by the version query and the "list of pages..." not returning errors.

Maybe there is something in the older mediawiki install that causes it to fail. I don't know if updating it to a newer version is an option.

Unfortunately I'm not able to provide much support for yamdwe any more, as advertised on the front page I'm looking for a new maintainer. Good luck finding the problem though!

Angus

colinsauze commented 8 years ago

I'm getting this same error on mediawiki 1.21.1.

colinsauze commented 8 years ago

I seem to have fixed it by changing the URL I give to the script. I was doing

python yamdwe.py http://localhost/mediawiki-1.21.1/ /var/www/dokuwiki/

by changing it to

python yamdwe.py http://localhost/mediawiki-1.21.1/api.php /var/www/dokuwiki/

It then worked. I'm not sure if I missed something in the documentation telling you to put api.php, if it wasn't documented or if newer versions of mediawiki don't need you to put api.php?

projectgus commented 8 years ago

@colinsauze Thanks for following up. The docs do say to use /api.php (and in the original issue post from @malfonsi you'll see that /api.php was in the URL, so it's a different issue with a similar symptom.)

However, people seem to miss this point in the docs a lot (which is a fair enough) so I just added a warning message if the URL doesn't end in api.php, and also a clearer error message if a non-JSON response comes back.

@malfonsi I don't think any of this will help fix the problem you're seeing, unfortunately. For some reason the api.php on your wiki send back JSON for the first few requests, then returned some kind of non-JSON message.

projectgus commented 8 years ago

@malfonsi I just added a --verbose option, that will print the exact invalid content if non-JSON content is returned. Can you rerun with this option and see what you see?

(It may be a lot of output depending on what the wiki is doing!)

malfonsi commented 8 years ago

@projectgus Thanks for the extra command option. It helped to narrow down the source of the problem.

Basically the script crash on the first page with this special sequence of characters (the extra information on the use of these characters comes from googling):

Raw encoding (hex) | UTF-8 encoding | HTML entity 0xE2,0x80,0x8E | &lrm; | Left-to-right mark

(sorry if the table above does not show right)

The funny parenthesis is that this is really present in the "wiki source code" (annoyingly invisible on simple text editors, but I "unveiled" them by copying & pasting the text on emacs and changing to ascii encoding). I have no clue how we did insert this special sequence in our pages ... maybe by copy & pasting from other web pages. Anyway let's close the parenthesis.

Anyway, is there any option to catch and digest (ignore would be fine for me) this or similar sequences? I have only basic knowledge of python syntax and I am not familiar with any of the used modules, but maybe you can address me to the module that is making the conversion and I can try to go through the code (I have found few suggestions on stackoverflow by googling the error message).

Thanks in advance for any additional hint.

P.S. I attach below the error message in case my interpretation was completely wrong:

Converting 22 revisions of page '02 April 2012'...
Traceback (most recent call last):
  File "yamdwe.py", line 93, in <module>
    main()
  File "yamdwe.py", line 65, in main
    exporter.write_pages(pages)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/dokuwiki.py", line 41, in write_pages
    self._convert_page(page)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/dokuwiki.py", line 97, in _convert_page
    content = wikicontent.convert_pagecontent(full_title, revision["*"])
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 68, in convert_pagecontent
    result = convert(root, context, False)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 90, in convert
    return convert_children(node, context)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 94, in convert
    return convert_children(node, context) + "\n"
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 204, in convert
    converted_list = convert_children(itemlist, context)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 210, in convert
    item_content = convert_children(item, context)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 281, in convert
    return convert_children(node, context)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/localscratch/TestMediawikiToDokuwiki/yamdwe/wikicontent.py", line 193, in convert
    return "{{%s%s}}" % (filename, caption)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
projectgus commented 8 years ago

Hi @malfonsi,

That's a very helpful update. Weird that the error message seems to have changed totally now, but well done tracking down the the specific problem with that Unicode sequence.

I've just made an update that may solve this problem by allowing link filenames and captions to be unicode instead of just ASCII (Python 2 is really painful in its ASCII vs Unicode handling, I wish we could use Python 3 for yamdwe but the Mediawiki library mwlib doesn't support it yet.)

Please let me know if that update improves things.

Angus

malfonsi commented 8 years ago

Dear @projectgus,

sorry for the long silence, but I tried something by myself. The patch does not work, but I am now convinced that there is something wrong in my virtualenv (or in the Debian 8 environment at all).

In fact I added these two lines to the main script "yamdwe.py", just after the main def:

def main():
    print("Start of the program")
    print("Stupid sequence \342\200\216")

and I get the error:

File "yamdwe.py", line 21, in main
    print("Stupid sequence \342\200\216")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 16-18: ordinal not in range(128)

I made a shorter script "test.py":

#from __future__ import print_function, unicode_literals, absolute_import, division
#import argparse, sys, codecs, locale, getpass, datetime
#from pprint import pprint
#import mediawiki, dokuwiki, wikicontent

def main():
    print("Start")
    print(u"Stupid string\u200e")

    stupid = "Stupid string\342\200\216"
    stupid2 = stupid.decode("utf-8")
    print ( stupid2[13] )
    print( len (stupid2.replace(u"\u200e","") ) )

main()

which works UNTIL I KEEP COMMENTED THE IMPORTATION OF ALL THE MODULES. You can see there my attempt to get rid of this special html entity, because it cannot be part of a filename (if you go back to my previous post you see that the initial problem was while getting the name of the uploaded media)

The error seems to come from one of the modules (again, I am not a python expert, I can only guess):

  File "test.py", line 14, in main
    stupid2 = stupid.decode("utf-8")
  File "/localscratch/TestMediawikiToDokuwiki/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-15: ordinal not in range(128)

narrowing down the problem, it seems connected to the importation of the symbol unicode_literals, i.e. this script works as expected if I do not import it.

Anyway: OPTION A: do you suggest to recreate another virtualenv? Please take into account that I created the virtualenv following the "Alternative Installation for Debian/Ubuntu Linux" directions (the system is a Debian 8). I can try to follow the full virtualenv installation (the paragraph just before), but I cannot try this before next year OPTION B: do you have any suggestion to work around this problem with the current virtualenv? At the end this conversion is once forever so also a quick and dirty solution would be fine.

Thanks again, Matteo

malfonsi commented 8 years ago

Something to add:

I found this page that can maybe be useful: http://python-future.org/unicode_literals.html

but with my python experience I need some time to digest it...

malfonsi commented 8 years ago

I managed to solve the issue. Basically: 1) I removed the import of the module "unicode_literals" everywhere. I am not sure if this is strictly needed, but it was easier to me to realize the workaround in (2) 2) for each page content, I remove this special character '\u200e' by using the replace method of strings, immediately after reading the content and before starting the interpretation of the content with the mwlib module

I think that this is something very specific to my case - and I admit that I have not really understood the real source of the problem, but rather I have just found a workaround for a tool that I need only once. Therefore @projectgus you are not probably interested to include any modification from my side, but just let me know if you are on a different advise.

pascalgross commented 8 years ago

Hey. I had the same error with the ascii codec stuff. Executing export LC_ALL=C pip install --upgrade setuptools helped. I don't know what this command does to python, so be careful.