Wrong TOC title encoding in some DjVu files

koreader / kindlepdfviewer

(DEPRECATED, please use KOReader instead) A PDF (plus DJVU, ePub, TXT, CHM, FB2, HTML...) viewer made for e-ink framebuffer devices, using muPDF, djvulibre, crengine

GNU General Public License v3.0

498 stars 98 forks source link

Wrong TOC title encoding in some DjVu files #87

Open houqp opened 12 years ago

houqp commented 12 years ago

eLiNK reports that one of his djvu book's TOC is not displayed correctly. I uploaded the test file to our DropBox share point and renamed to toc_title_encoding.djvu.

Some TOC titles in this book contains German umlauts like ä which is not displayed in our reader. After some debugging, I noticed that those German umlauts are not represented in UTF8. For instance, ä is represented as 0xE4 not 0xC3,0xA4. To double check, I opened this book with Okular (which also uses libdjvulibre) and it behaviors the same.

However, I tried other books with German umlauts TOC and those umlauts are encoded correctly. So I suspect the book is corrupted.

But eLiNK says when he open this book in WinDjView, those umlauts are handled correctly. WinDjView is also a freesoftware based on libdjvulibre, but I don't have time to look into yet.

traycold commented 12 years ago

hi, here follow a quick survey regarding the rendering of umlauts on TOC, using various readers:

JavaDjVu: [pure java] umlauts displayed correctly;
WinDjView: [based on djvulibre] umlauts displayed correctly;
djview: [based on djvulibre, actually official djvulibre reader] umlauts NOT displayed correctly;

Hope this can help.

houqp commented 12 years ago

Wow, thanks for the survey :)

Now it is very likely that the bug lies in ddjvuapi.cpp. WinDjView uses low-level APIs from djvulibre, not those defined in ddjvuapi.cpp.

dpavlin commented 12 years ago

In perl, we usually turn magic bit and strings becomes utf-8. It seems that lua doesn't care according to http://lua-users.org/wiki/LuaUnicode and from the codes, it seems that 0xE4 is iso-8859-1 (latin1) encoding.

We are using ddjvu_document_get_outline from libdjvu/ddjvuapi.cpp but only encoding specific call in that API is ddjvu_document_create_by_filename_utf8 so we can assume that libdjvu always returns latin1 (or can we?).

It seems to me that this is DjVu encoder bug. I would guess that it was running on machine with latin1 encoding and it didn't do any conversion.

There are several utf-8 encoding references in DjVu3Spec.djvu (which I read using kindlepdfview on my laptop, since it's only reader with djvu support which I have installed :-) so I would suggest that all encoding returned from djvu should be in utf-8 encoding.

houqp commented 12 years ago

We are using ddjvu_document_get_outline from libdjvu/ddjvuapi.cpp but only encoding specific call in that API is ddjvu_document_create_by_filename_utf8 so we can assume that libdjvu always returns latin1 (or can we?).

I tried to create a toc with ä (using djvused) and ddjvu_document_get_outline does returned a correct utf8 encoding. I also tried to create toc with Asian Characters and the encoding is correct too. But my system is using UTF8 as locale.

The spec says toc text should be stored in UTF8. Probably the text of toc in this file is stored in latin1?

(which I read using kindlepdfview on my laptop, since it's only reader with djvu support which I have installed :-)

LOL

so I would suggest that all encoding returned from djvu should be in utf-8 encoding.

We we might need to hack ddjvu_document_get_outline a little bit.