messense / mupdf-rs

Rust binding to mupdf
GNU Affero General Public License v3.0
94 stars 21 forks source link

How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

Open LazyGeniusMan opened 1 year ago

LazyGeniusMan commented 1 year ago

When I try coverting a page that have image to html or xhtml, the image is not included. With this code:

fn main() {
    use mupdf::{Document, Page};
    use std::fs;

    let doc: Document = Document::open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub").unwrap();
    let page: Page = doc.load_page(341).unwrap();
    let html: String = page.to_html().unwrap();

    fs::write("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\rs-test.html", html);
}

I got this result: image

there should be an image above Figure 10.3 text.

I tried to do the same thing in PyMuPDF with this code:

import fitz

doc = fitz.Document('C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub')
page = doc[331] # the page index is somehow different for the same page I want
html = page.get_text("html")

with open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\py-test.html", "w") as file:
    file.write(html)

I got this result: image

the image is included in base64 format.

I also tried doing the same thing via mutool convert cli, and can get the same result but there's an option that need to be enabled, I dont find anyway to set this thing in to_html method of this crate. The option in mutool look like this:

Text output options:
        inhibit-spaces: don't add spaces between gaps in the text
        preserve-images: keep images in output
        preserve-ligatures: do not expand ligatures into constituent characters
        preserve-whitespace: do not convert all whitespace into space characters
        preserve-spans: do not merge spans on the same line
        dehyphenate: attempt to join up hyphenated words
        mediabox-clip=no: include characters outside mediabox
messense commented 1 year ago

Sorry, this project is not actively maintained at the moment, but I'm happy to accept pull requests to fix this if anyone is up for it.