Offer EPUB download - Githubissues

Throne3d commented 7 years ago

Rather than having site data be scraped and then processed into an EPUB format, actually offer these EPUBs from the Constellation itself.

Issues:

Downloading the images could pose a security risk, or could be rather large in size; linking to online images means the EPUBs would be smaller but would not work properly offline.
Certain replies would need to be sanitized, especially if they currently involve (hacky) quoting

teceler commented 5 years ago

HalCanary commented 3 years ago

I once scraped a thread to make an ebook for reading on a Kindle. I removed all images, as well as the timestamp to make it look right. I also removed all extra boxes and margins.

I would love to take a look at implementing this feature server-side — maybe just output a sinmple markdown document for a given thread.

HalCanary commented 3 years ago

Here's my pass at this: https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab

Throne3d commented 3 years ago

Hey @HalCanary! Thanks for the interest. While that's an interesting proof of concept, I don't think it'll help too much for a few different reasons – I've done some work on EPUB generation before, and the issues mostly aren't in the bare-bones generation. In particular, though:

Our stack mostly runs on Ruby, and due to hosting constraints, we're likely to have trouble executing Python at the same time (especially needing to communicate between the processes, and given your snippet is in Python 2, EOL January last year)
It looks like you're relying on an external process, ebook-convert, for most of the heavy lifting – we might be able to install that, but we'd need to investigate it more thoroughly (and the same hosting constraints probably get in the way here)
It doesn't really consider the main concerns listed in the post body, around security &/ sanitization; but we have additional constraints in recent development around performance, which this is likely to hit into (loading all the pages of a post can be quite memory intensive)

This issue also isn't super high priority right now, and our dev efforts have been quite limited recently.

This could work for individuals looking to scrape ebooks from the site – maybe it would be useful shared in one of the communities focused around Glowfic and the Constellation (e.g. the Discord servers)?

l1n commented 3 years ago

Hal, feel free to contact me on Discord ([removed]) if you'd like to set this up on glowficrss.com (which provides some auxilliary utilities and has some extra hosting resources).

HalCanary commented 3 years ago

ebook-convert comes with Calibre. Providing a single-page simple HTML document for download would allow anyone with Calibre to do the conversion themselves.

I don't know Ruby and don't have the bandwidth to learn it at the moment.

One question: the post.content field is sometimes formatted with , sometimes with  , and sometimes with newlines. Are you just looking for those strings to determine which it is? Is it ever a mixture? I ask because I like indented paragraphs in ebooks.

l1n commented 3 years ago

Flat view for posts exists and might do what you want for this, I haven't tried it with Calibre specifically since you can retrieve replies using the API instead.

Throne3d commented 3 years ago

I don't remember exactly which format we expose in the API (if we do anything to it), but in the HTML we have three different representations:

If we see  or   tags (matches /<p( [^>]*)?>/ or / /), we treat it as HTML that wants to handle its own linebreaks
Otherwise, if we see a <blockquote> tag (/<blockquote( |>)/), we just replace linebreaks with   - <blockquote> and other block elements interact poorly with our simplistic auto-paragraphing below
Otherwise, we chunk it by linebreaks – splitting by \n\n, we wrap each section in paragraphs, filling empty paragraphs with   to ensure they display, and turning any remaining single linebreaks within paragraphs into  s

HalCanary commented 3 years ago

I hacked on my python script last night:

I split it into three scripts:
1. One downloads the entire thread and saves the data structure as a single JSON file.
2. The second converts the JSON file to HTML. This is where all of my opinions on how to style ebooks so they look nice on a kindle or on a computer screen come in.
3. The final script calls into Calibre's ebook-convert to convert the HTML into EPUB and MOBI, setting the --title and --authors attribute correctly, as well as putting the source URL into the --comments section.
The scripts run fine on either Python2 or Python3.
I send the post content through the lxml package to clean it up, resulting in well-formed HTML.

https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab

I did not know about flat view.

QuartzLibrary commented 2 years ago

Hi,

Just wanted to log here that I've written an alternative converter.

Not really looking for this to be integrated in Glowfic, but it does address a few of the concerns raised in case you'll decide to prioritise epubs in the future:

Downloading the images could pose a security risk

This isn't really addressed, though linking to images is easy of course.

Certain replies would need to be sanitized, especially if they currently involve (hacky) quoting

Currently the content is parsed, sanitised, and then serialised. A full run of the entire database would be needed to find less stringent rules that work well.

Our stack mostly runs on Ruby, and due to hosting constraints, we're likely to have trouble executing Python at the same time (especially needing to communicate between the processes [...])

Not sure how Rust programs are usually called from Ruby, though executing it as a static binary or separate service might work well. This could also in theory be adapted to work directly in the browser (though possibly a performance no-no given no way of caching) since CORS wouldn't be a problem for you.

relying on an external process

The program is standalone and can be compiled to a static binary.

[...] we have additional constraints in recent development around performance, which this is likely to hit into (loading all the pages of a post can be quite memory intensive)

Presumably caching could take up a lot of the load, but that would require more work unless a hacky file cache like the one I already implemented was used.

As a side note, one thing I missed in the API was a way to get a list of posts in a board or board_section. Not really asking for it, I can parse the website if needed in the future, but saying so in case there turns out to be something I missed.

That's it. Thank you for working on Glowfic!

glowfic-constellation / glowfic

Offer EPUB download #9