readium / readium-sdk

A C++ ePub renderer SDK
BSD 3-Clause "New" or "Revised" License
383 stars 164 forks source link

Treating XHTML as HTML breaks cross-platform CFI consistency #315

Closed winniequinn closed 1 year ago

winniequinn commented 6 years ago

Due to what I believe to be a behavior of libxml2, XHTML content from within an EPUB has the following meta tag added if no http-equiv="Content-Type" meta tag exists:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This causes browsers to interpret XHTML documents as HTML which in turn triggers browser fallback behavior that results in radically different DOMs across browsers. As such, the logic for calculating CFIs produces CFIs that are not portable across platforms.

Thank you to @JCCR for helping me debug this issue.

This issue is a bug.

Expected Behaviour

Readium produces CFIs that are portable across Android and iOS.

Observed behaviour

Readium does not.

Test file(s)

Here is a single XHTML file from an EPUB (source.xhtml) along with the DOMs produced for Android and iOS:

https://gist.github.com/winniequinn/b0865f110ffc821e2e583bfa51c30f17

Product

Native application (Readium SDK C++)

danielweck commented 6 years ago

If I remember correctly , the application/xhtml+xml HTTP content-type is correctly included in the response created by the Android and iOS apps' respective HTTP servers. Let me find the source code references...

danielweck commented 6 years ago

Android:

https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86 https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L85-L86

tmpMap.put("html", "application/xhtml+xml"); // FORCE
tmpMap.put("xhtml", "application/xhtml+xml"); // FORCE

https://github.com/readium/SDKLauncher-Android/blob/master/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229 https://github.com/readium/SDKLauncher-Android/blob/0bd7eb1ac88616a1eba3dfe27dd5d93c17d8ef44/SDKLauncher-Android/app/src/main/java/org/readium/sdk/android/launcher/util/EpubServer.java#L209-L229

        String mime = null;
        int dot = uri.lastIndexOf('.');
        if (dot >= 0) {
            mime = MIME_TYPES.get(uri.substring(dot + 1).toLowerCase());
        }
        if (mime == null) {
            mime = "application/octet-stream";
        }

        ManifestItem item = pckg.getManifestItem(uri);
        String contentType = item != null ? item.getMediaType() : null;
        if (!mime.equals("application/xhtml+xml")
                && !mime.equals("application/xml") // FORCE
                && contentType != null && contentType.length() > 0) {
            mime = contentType;
        }

        PackageResource packageResource = pckg.getResourceAtRelativePath(uri);

        boolean isHTML = mime.equals("text/html")
|| mime.equals("application/xhtml+xml");
danielweck commented 6 years ago

iOS:

https://github.com/readium/readium-sdk/blob/master/Platform/Apple/RDServices/Main/RDPackageResourceResponse.m#L57-L66 https://github.com/readium/readium-sdk/blob/3ba644f7dee27d29a5c45c4dee6cfabcca4df52d/Platform/Apple/RDServices/Main/RDPackageResourceResponse.m#L57-L66

- (NSDictionary *)httpHeaders {
    if(m_resource.relativePath) {
        NSString* ext = [[m_resource.relativePath pathExtension] lowercaseString];
        if([ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"]) {
            return [NSDictionary dictionaryWithObject:@"application/xhtml+xml" forKey:@"Content-Type"]; // FORCE
        }
        else if([ext isEqualToString:@"xml"]) {
            return [NSDictionary dictionaryWithObject:@"application/xml" forKey:@"Content-Type"]; // FORCE
        }
}

https://github.com/readium/readium-sdk/blob/master/Platform/Apple/RDServices/Main/RDPackageResourceConnection.m#L275-L302 https://github.com/readium/readium-sdk/blob/3ba644f7dee27d29a5c45c4dee6cfabcca4df52d/Platform/Apple/RDServices/Main/RDPackageResourceConnection.m#L275-L302

        NSString* ext = [[path pathExtension] lowercaseString];
        bool isHTML = [ext isEqualToString:@"xhtml"] || [ext isEqualToString:@"html"] || [resource.mimeType isEqualToString:@"application/xhtml+xml"] || [resource.mimeType isEqualToString:@"text/html"];

        if (isHTML) {
            NSString * FALLBACK_HTML = @"<?xml version=\"1.0\" encoding=\"UTF-8\"?><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>HTML READ ERROR</title></head><body>ERROR READING HTML BYTES!</body></html>";

            NSData *data = [resource readDataFull];
            if (data == nil || data.length == 0)
            {
                data = [FALLBACK_HTML dataUsingEncoding:NSUTF8StringEncoding];
            }

            NSXMLParser *xmlparser = [[NSXMLParser alloc] initWithData:data];
            [xmlparser setShouldResolveExternalEntities:NO];
            BOOL isXhtmlWellFormed = [xmlparser parse];

            if (isXhtmlWellFormed == NO)
            {
                NSLog(@"XHTML PARSE ERROR: %@", xmlparser.parserError);

                // Can be used to check / debug encoding issues
                NSString * dataStr = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
                //NSLog(@"XHTML SOURCE: %@", dataStr);

                // FORCE HTML WebView parser
                //@"application/xhtml+xml"
                resource.mimeType = @"text/html";
}
danielweck commented 6 years ago

As you can see, in iOS there is a fallback to HTML for malformed XHTML documents ... yes, there exists EPUBs authored with not-well-formed XML-HTML :( But otherwise, the application/xhtml+xml content type should be correctly passed into HTTP responses via the appropriate headers ... well, at least in the ReadiumSDK "launcher apps".

I am not sure about libxml's responsibility here, are you sure it isn't an implementation trick specific to UIWebview / WKWebView? (on iOS)

jccr commented 6 years ago

The mimetype served through HTTP is fine. The problematic bit is the html http equiv meta tag that has the incorrect mimetype. It appears that for this case it is always having the value of text/html.

The XML lib used by the ReadiumSDK logic seems to be responsible.

On Thu, Jun 21, 2018, 8:12 AM Daniel Weck, notifications@github.com wrote:

As you can see, in iOS there is a fallback to HTML for malformed XHTML documents ... yes, there exists EPUBs authored with not-well-formed XML-HTML :( But otherwise, the application/xhtml+xml content type should be correctly passed into HTTP responses via the appropriate headers ... well, at least in the ReadiumSDK "launcher apps".

I am not sure about libxml's responsibility here, are you sure it isn't an implementation trick specific to UIWebview / WKWebView? (on iOS)

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/readium/readium-sdk/issues/315#issuecomment-399138928, or mute the thread https://github.com/notifications/unsubscribe-auth/AE5RbFqVt2u7V18wl9s9zfA-rtxIMtWkks5t-7fXgaJpZM4UwC6B .

danielweck commented 6 years ago

Could you please share the EPUB you are using to reproduce the issue? I would like to run a quick test with SDKLauncher-OSX.

danielweck commented 6 years ago

There is no DOM parsing (libxml or not) involved when the iOS / Android / OSX platform-native code (not C++, but Java or ObjectiveC) pre-processed the HTML documents in order to inject the navigator.epubReadingSystem stuff (it is all regular expression / plain text) ... so I do no understand why / how the <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> markup would be added.

IMO, the next best course of action is to reproduce (if possible) the bug with the SDKLauncher-iOS/Android/OSX apps, because it may be that your app is built with a different logic.

jccr commented 6 years ago

@winniequinn Let's try to reproduce this using the vanilla launchers. Could you share the EPUB that was used in your testing?

jccr commented 6 years ago

@danielweck I couldn't reproduce this using various books, including a similar Alice book like winnie's sample. I am using the Android launcher.

winniequinn commented 1 year ago

This repository seems dormant, so I hope no one minds me doing a little housecleaning and closing out this stale issue I reported. @danielweck, feel free to reopen if you think it's worth keeping this active!