psiegman / epublib

a java library for reading and writing epub files
http://www.siegmann.nl/epublib
1.05k stars 316 forks source link

cannot read epub with ncx having as dtd "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd" #2

Closed petrul closed 13 years ago

petrul commented 13 years ago

Hello,

trying to read an epub, whose fb.ncx resource has the following dtd:

<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN" "http://www.daisy.org/z3986/2005/ncx-2005-1.dtd">

The defined EntityResolver tries to use an internal resource instead of the remote url, which is good, but the given dtd from daisy.org is not cached. I don't know the epub conventions, what is compulsory and what not, but I suppose that dtd should be cached too.

The Epub is The Man Who Was Thursday: a Nightmare, by Chesterton, Gilbert Keith from feedbooks.

Regards; Petru

psiegman commented 13 years ago

Thanks for noticing this. Fixed in the new release. Could you see if it works for you now ?

petrul commented 13 years ago

Hmm, I'm trying to see how can I use and modify epublib without keeping modifications in my own world. Actually I had done a couple of minor modifications, I would hate to lose them, I wonder if you would consider them worth merging. I am currently trying to use epublib in an online epub reader so reading functionality interests me most. I may contribute some more if I need to add functionalities etc. The problem is I never used git before, I have no idea for now how to make a patch or to commit in case you want to merge my modifications. I'll paste them here for now:

diff --git a/src/main/java/nl/siegmann/epublib/epub/EpubProcessor.java b/src/main/java/nl/siegmann/epublib/epub/EpubProcessor.java
index 7a26d78..b94469c 100644
--- a/src/main/java/nl/siegmann/epublib/epub/EpubProcessor.java
+++ b/src/main/java/nl/siegmann/epublib/epub/EpubProcessor.java
@@ -34,8 +34,12 @@ public class EpubProcessor {
            } else {
                resourcePath = previousLocation + systemId.substring(systemId.lastIndexOf('/'));
            }
-           InputStream in = EpubProcessor.class.getClassLoader().getResourceAsStream(resourcePath);
-           return new InputSource(in);
+           if (this.getClass().getClassLoader().getResource(resourcePath) == null) {
+               throw new RuntimeException("remote resource is not cached : [" + systemId + "] cannot continue");
+           } else {
+               InputStream in = EpubProcessor.class.getClassLoader().getResourceAsStream(resourcePath);
+               return new InputSource(in);
+           }
        }
    };

diff --git a/src/main/java/nl/siegmann/epublib/epub/NCXDocument.java b/src/main/java/nl/siegmann/epublib/epub/NCXDocument.java
index e01491b..a7525cc 100644
--- a/src/main/java/nl/siegmann/epublib/epub/NCXDocument.java
+++ b/src/main/java/nl/siegmann/epublib/epub/NCXDocument.java
@@ -110,7 +110,7 @@ public class NCXDocument {
            List
sections = readSections(navmapNodes, xPath, book); book.setTocSections(sections); } catch (Exception e) { - log.error(e); + log.error(e, e); } } diff --git a/src/test/java/nl/siegmann/epublib/epub/EpubReaderTest.java b/src/test/java/nl/siegmann/epublib/epub/EpubReaderTest.java index 9160a40..8e88db8 100644 --- a/src/test/java/nl/siegmann/epublib/epub/EpubReaderTest.java +++ b/src/test/java/nl/siegmann/epublib/epub/EpubReaderTest.java @@ -2,10 +2,14 @@ package nl.siegmann.epublib.epub; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; +import java.util.List; import junit.framework.TestCase; import nl.siegmann.epublib.domain.Book; import nl.siegmann.epublib.domain.InputStreamResource; +import nl.siegmann.epublib.domain.Section; + +import org.apache.log4j.Logger; public class EpubReaderTest extends TestCase { @@ -48,4 +52,30 @@ public class EpubReaderTest extends TestCase { assertTrue(false); } } + + public void testReadChesterton() throws Exception { + Book epub = new EpubReader().readEpub(this.getClass().getResourceAsStream("chesterton-thursday-nightmare.epub")); + { + List
tocSections = epub.getTocSections(); + assertNotNull(tocSections); + assertEquals(19, tocSections.size()); + for (Section s : tocSections) { + assertNotNull(s.getTitle()); + assertTrue(s.getTitle().length() > 0); + } + } + + { + // is this intended behaviour? (22 spine sections as opposed to 19 toc sections, and all spine titles are null) + List
spineSections = epub.getSpineSections(); + assertNotNull(spineSections); + assertEquals(22, spineSections.size()); + for (Section s : spineSections) { + assertNull(s.getTitle()); + } + } + + } + + static Logger LOG = Logger.getLogger(EpubReaderTest.class); }
psiegman commented 13 years ago

Hi Petrul,

Thanks for the patch.Patches are always welcome. Tomorrow I will have some time to look at it.

In the meantime: About the 22 spine sections vs. the 19 toc sections: It could be a bug, but it is also possible that this book does have a different number of spine sections then toc sections. The spine is the reading order of the book. It has to contain every single page of the book in the right order. The table of contents is added for fast access to content. In theory you can have many table of content entries pointing at the same section, or spine sections not in the table of contents. An example of the latter would be a very long html file that is split up in manageable pieces. (on the epublib TODO list) All these pieces have to be in the spine, but the would not appear in the table of contents.

About the spine sections having null title: this is correct. As the spine is only the reading order, they don't need a title. The table of content items do. But if this leads to confusion this is at least a documentation bug :) Will think of something clever to fix this.

regards, Paul