rupa / epub

python/curses cli epub reader
380 stars 63 forks source link

Title finding for TOC does not always work correctly #22

Open keithstellyes opened 7 years ago

keithstellyes commented 7 years ago

Sometimes, when using epub.py to open some .epub files, the TOC view won't display the chapter titles just numbers.

Does anybody have a file it works with?

I'm still working on isolating the cause of this.

Boruch-Baum commented 7 years ago

A first step is to to see how the TOC appears in another epub reader, eg. fbreader. This is because some epub files don't have named TOC entries. In the file that I'm using to debug other issues in epub some chapters are named and some aren't! If that's the case for your epub file, please let us know the result, and close the issue.

Boruch-Baum commented 7 years ago

I've looking at the code just now and am pretty sure I see where the problem is. The snippet beginning currently on line 152 defines a variable 'y' (please use helpful variable names) which is constructed to become a list of document nodes, based upon the manifest in file content.opf. However, the list of document nodes is a super-set of the set of table of content nodes, ie. some nodes are intentionally only reachable 'laterally' from their preceding or subsequent node, not from the table of contents. You can see this by comparing the elements of 'y' to the elements of file toc.ncx.

The first part of the solution is trivial. Comment out these lines (currently 155-156): else: yield (u'', section.encode('utf-8').strip())

However, you can't do that until a method to laterally navigate the nodes is coded, because you'll lose all access to nodes not directly linked from the toc. For that, it seems some re-arranging of the code will be necessary, because what I see is an ordered variable 'chaps' for nodes of the toc, but not the superset of all nodes.

s-victor commented 6 years ago

Hi, it seems that the problem is related to path/folder structure of a epub file. Type 1: If both ".opf" and ".ncx" located under epub file's root dir, the TOC will display correctly. Type 2: if ".opf" and ".ncx" located in a subfolder (usually named "OEBPS"), then TOC will not display, only numbers are shown. (For reference, epub files from "Project Gutenberg" are created with this kind folder structure, and thus suffers from this issue)

I find a way to make the TOC display correctly for both types, Change this line (147): x[d['id']] = '{0}{1}'.format(basedir, d['href']) To: x[d['id']] = '{1}'.format(basedir, d['href'])

It seems the original line gives the correct path for href link for both types, but not a correct ".ncx" path for Type 2 epub, and thus couldn't display TOC on Type 2 epub.

However, after the change, while TOC now displays correctly for both types, it breaks the links inside Type 2 epub, with an error msg KeyError: "There is no item named 'Text/index_01.html' in the archive". (Where the correct path should be 'OEBPS/Text/index_01.html')

So far I was unable to solve this. Hope someone will find a solution.

s-victor commented 6 years ago

Good news! Finally, I found solution to this issue, the fix involves 3 changes as below:

Change line 147: x[d['id']] = '{0}{1}'.format(basedir, d['href']) To: x[d['id']] = d['href']

Change line 171: yield (z[section].encode('utf-8'), section.encode('utf-8')) To: yield (z[section].encode('utf-8'), '{0}{1}'.format(basedir, section.encode('utf-8')))

Change line 173: yield (u'', section.encode('utf-8').strip()) To: yield (u'', '{0}{1}'.format(basedir, section.encode('utf-8').strip()))

Both types of epub files (as from my previous reply) now shows TOC correctly. Please let me know if it works. Edit: added change to line 173 (which also has path problem), now all TOC and page navagation should work correctly.

s-victor commented 6 years ago

Just found out that someone in 2014 already fixed this issue in his fork, which is far better... so please ignore my previous replies.

Link to his fix: https://github.com/xuxiaodong/epub/commit/0469ac1153b357578a56f0493f3bd9f57acf82ba