Support epub format - Githubissues

lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!

https://lychee.cli.rs

Apache License 2.0

2.17k stars 132 forks source link

Support epub format #202

Open mre opened 3 years ago

mre commented 3 years ago

Would be nice to help out creators of ebooks and check for broken links. That seems to be a real problem. There is a tool called epub-linkchecker, but development seems to have stalled. I don't know much about epubs, but it looks like it's some XML-like format with normal <a> tags, so it might not be too hard to parse (either with our html link extractor or the plaintext extractor).

If there are any people interested in trying to add support, add a comment here. We can provide some guidance if needed.

lebensterben commented 3 years ago

well there's a nuclear option...

We can always convert anything to html, by pandoc. The downside is, it's parsed twice, by pandoc and html5ever (if the output is html).

atomotic commented 3 years ago

it's easy: you can use acat (extract files to standard out) from atool and pipe to lychee stdin

acat -F zip {file.epub} "*.xhtml" "*.html" | lychee  -

i wrote in the past here: https://literarymachin.es/epub-linkrot/ also epub-linkchecker was a quick experiment of mine. i discovered lychee today, this replaces my previous hacks

mre commented 3 years ago

That's cool! I will add that to the README.md. Thanks for the hint and for epub-linkchecker, which served as an inspiration.

mre commented 3 years ago

Done. I like the approach of using acat in combination with lychee. Each tool is responsible for exactly one task. However I'm still considering to support epub natively in the end. acat is written in Perl and it would be nice to have a pure-Rust version at some point. There is an epub crate which we could use. I'll keep that issue open for future discussion.

mre commented 2 years ago

We want to use pandoc for another use-case (https://github.com/lycheeverse/lychee/issues/291), so that's probably the way to go as it supports many different formats. I still don't know if we want to directly add the pandoc binary to lychee or just check if it's in the path.

The way it could work is

lychee book.epub

This would "just work" if pandoc is in the PATH and if not, it would throw an error:

Cannot read `book.epub`: Please install `pandoc` for handling epub files.

mre commented 2 years ago

As an addition, we could call pandoc --from=FORMAT to see if a file-type is supported if lychee doesn't support it itself.

lebensterben commented 2 years ago

see https://github.com/phiresky/ripgrep-all/blob/2d63efd3156a2eca633b1b43d67931fe1cb0df6e/src/adapters/custom.rs for how ripgrepa-all calls pandoc as a subprocess.

they require pandoc as a dependency and don't ship it directly. We can do this as well. And throw errors when it's needed but not found.

as for the lychee-action, we can add bundle one for convenience since that's not very big.