Open mre opened 3 years ago
well there's a nuclear option...
We can always convert anything to html, by pandoc. The downside is, it's parsed twice, by pandoc and html5ever (if the output is html).
it's easy: you can use acat (extract files to standard out) from atool and pipe to lychee stdin
acat -F zip {file.epub} "*.xhtml" "*.html" | lychee -
i wrote in the past here: https://literarymachin.es/epub-linkrot/ also epub-linkchecker was a quick experiment of mine. i discovered lychee today, this replaces my previous hacks
That's cool! I will add that to the README.md
. Thanks for the hint and for epub-linkchecker, which served as an inspiration.
Done. I like the approach of using acat in combination with lychee. Each tool is responsible for exactly one task. However I'm still considering to support epub natively in the end. acat is written in Perl and it would be nice to have a pure-Rust version at some point. There is an epub crate which we could use. I'll keep that issue open for future discussion.
We want to use pandoc for another use-case (https://github.com/lycheeverse/lychee/issues/291), so that's probably the way to go as it supports many different formats. I still don't know if we want to directly add the pandoc binary to lychee or just check if it's in the path.
The way it could work is
lychee book.epub
This would "just work" if pandoc is in the PATH
and if not, it would throw an error:
Cannot read `book.epub`: Please install `pandoc` for handling epub files.
As an addition, we could call pandoc --from=FORMAT
to see if a file-type is supported if lychee doesn't support it itself.
see https://github.com/phiresky/ripgrep-all/blob/2d63efd3156a2eca633b1b43d67931fe1cb0df6e/src/adapters/custom.rs for how ripgrepa-all calls pandoc as a subprocess.
they require pandoc as a dependency and don't ship it directly. We can do this as well. And throw errors when it's needed but not found.
as for the lychee-action, we can add bundle one for convenience since that's not very big.
Would be nice to help out creators of ebooks and check for broken links. That seems to be a real problem. There is a tool called epub-linkchecker, but development seems to have stalled. I don't know much about epubs, but it looks like it's some XML-like format with normal
<a>
tags, so it might not be too hard to parse (either with our html link extractor or the plaintext extractor).If there are any people interested in trying to add support, add a comment here. We can provide some guidance if needed.