orthros / dart-epub

Epub Reader and Writer for Dart
MIT License
218 stars 226 forks source link

EpubReader crashes when reading a book with the TOC referencing a file with unicode/utf8 characters #31

Closed ShadowJonathan closed 6 years ago

ShadowJonathan commented 6 years ago

This is probably an archive fault, but best to start mentioning it here, as it also affects this package:

  for (var file in files) {
    if (verbose) print("Getting file " + file.path);
    EpubBook book =
        await EpubReader.readBook(new File.fromUri(file.uri).readAsBytesSync());
    if (verbose) print("  " + book?.Title);
    ...
  }

Resulting (after a while) into the following throw:

Getting file D:/Google Drive/Books\Complete Turnabout - Nenilein.epub
Unhandled exception:
Exception: EPUB parsing error: file OEBPS/Mentor-and-Protégé.html not found in archive.
#0      EpubContentFileRef.getContentFileEntry (package:epub/src/ref_entities/epub_content_file_ref.dart:47:7)
#1      EpubContentFileRef.getContentStream (package:epub/src/ref_entities/epub_content_file_ref.dart:34:30)
#2      EpubContentFileRef.readContentAsText (package:epub/src/ref_entities/epub_content_file_ref.dart:28:31)
<asynchronous suspension>
#3      EpubReader.readTextContentFiles.<anonymous closure> (package:epub/src/epub_reader.dart:100:45)
<asynchronous suspension>
#4      Future.forEach.<anonymous closure> (dart:async/future.dart:484)
#5      Future.doWhile.<anonymous closure> (dart:async/future.dart:526)
#6      _RootZone.runUnaryGuarded (dart:async/zone.dart:1316)
#7      _RootZone.bindUnaryCallbackGuarded.<anonymous closure> (dart:async/zone.dart:1355)
#8      _RootZone.runUnary (dart:async/zone.dart:1381)
#9      _FutureListener.handleValue (dart:async/future_impl.dart:129)
#10     Future._propagateToListeners.handleValueCallback (dart:async/future_impl.dart:638)
#11     Future._propagateToListeners (dart:async/future_impl.dart:667)
#12     Future._complete (dart:async/future_impl.dart:472)
#13     _SyncCompleter.complete (dart:async/future_impl.dart:51)
#14     _completeOnAsyncReturn (dart:async-patch/dart:async/async_patch.dart:292)
#15     EpubReader.readTextContentFiles.<anonymous closure> (package:epub/src/epub_reader.dart:102:5)
#16     _RootZone.runUnary (dart:async/zone.dart:1381)
#17     _FutureListener.handleValue (dart:async/future_impl.dart:129)
#18     Future._propagateToListeners.handleValueCallback (dart:async/future_impl.dart:638)
#19     Future._propagateToListeners (dart:async/future_impl.dart:667)
#20     Future._complete (dart:async/future_impl.dart:472)
#21     _SyncCompleter.complete (dart:async/future_impl.dart:51)
#22     _completeOnAsyncReturn (dart:async-patch/dart:async/async_patch.dart:292)
#23     EpubContentFileRef.readContentAsText (package:epub/src/ref_entities/epub_content_file_ref.dart:30:5)
#24     new Future.microtask.<anonymous closure> (dart:async/future.dart:200)
#25     _microtaskLoop (dart:async/schedule_microtask.dart:41)
#26     _startMicrotaskLoop (dart:async/schedule_microtask.dart:50)
#27     _runPendingImmediateCallback (dart:isolate-patch/dart:isolate/isolate_patch.dart:113)
#28     _RawReceivePortImpl._handleMessage (dart:isolate-patch/dart:isolate/isolate_patch.dart:166)

Changing lib/src/ref_entities/epub_content_file_ref.dart:~40 to this:

ArchiveFile contentFileEntry = epubBookRef.EpubArchive().files.firstWhere(
            (ArchiveFile x) {print(x.name);return x.name == contentFilePath;},
        orElse: () => null);

Reveals that the archive files are named like this:

OEBPS/Miseries.html
OEBPS/First-Investigations.html
OEBPS/Mentor-and-Protégé.html
OEBPS/Mayas-Objection.html
OEBPS/Alone.html

Here is the epub file in question (zipped, as github doesnt accept .epub files) Complete Turnabout - Nenilein.zip

It was downloaded from http://ficsave.xyz

orthros commented 6 years ago

Yeah, looking at this, it appears as though Archive is working with ascii strings as opposed to utf8 encoded strings.

OEBPS/Mentor-and-Protégé.html as opposed to OEBPS/Mentor-and-Protégé.html

I'm not sure if this is a problem with archive itself or the software that originally created the epub.

My understanding of zip files is that they write the file headers as a raw byte array and it is up to the unzip library to properly handle the different encodings (but it has been a while since I've looked at the zip specification)

ShadowJonathan commented 6 years ago

Just looked at the latest version number, it's >2... and epub uses 1.0.33, version incompatibility issue?

orthros commented 6 years ago

I'll upgrade locally and see if it fixes the issue.

Edit: What version were you seeing this in? The latest version of epub uses version ^2.0.0 of archive

Edit2: Tried locally with archive 2.0.2 and the error persists.

brendan-duncan commented 6 years ago

I'll take a look at fixing archive to handle these filenames.

brendan-duncan commented 6 years ago

I pushed archive 2.0.3 that should fix this problem by reading filename strings as utf8.

ShadowJonathan commented 6 years ago

And making sure that the EpubReader class always reads as utf8 should fix this.

I have a small feeling it'll cause some other issues (like even weirder characters that only appear in unicode), but it'll work for now, I guess

orthros commented 6 years ago

Thanks @brendan-duncan for the new version! I just pub upgrade'd and the epub @ShadowJonathan attached opens now.

I'll add a few more tests to the library to put the attached epub through its paces. Thanks!