glowfic-constellation / glowfic

The Glowfic Constellation
https://www.glowfic.com
MIT License
17 stars 11 forks source link

Offer EPUB download #9

Open Throne3d opened 7 years ago

Throne3d commented 7 years ago

Rather than having site data be scraped and then processed into an EPUB format, actually offer these EPUBs from the Constellation itself.

Issues:

teceler commented 5 years ago

Old todo related to this: https://github.com/Marri/glowfic/blob/ad00696b4d2eeb0a3e19aec9fe6adf270a45fd79/app/views/posts/show.haml#L34-L38

HalCanary commented 3 years ago

I once scraped a thread to make an ebook for reading on a Kindle. I removed all images, as well as the timestamp to make it look right. I also removed all extra boxes and margins.

I would love to take a look at implementing this feature server-side — maybe just output a sinmple markdown document for a given thread.

HalCanary commented 3 years ago

Here's my pass at this: https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab

Throne3d commented 3 years ago

Hey @HalCanary! Thanks for the interest. While that's an interesting proof of concept, I don't think it'll help too much for a few different reasons – I've done some work on EPUB generation before, and the issues mostly aren't in the bare-bones generation. In particular, though:

This issue also isn't super high priority right now, and our dev efforts have been quite limited recently.

This could work for individuals looking to scrape ebooks from the site – maybe it would be useful shared in one of the communities focused around Glowfic and the Constellation (e.g. the Discord servers)?

l1n commented 3 years ago

Hal, feel free to contact me on Discord ([removed]) if you'd like to set this up on glowficrss.com (which provides some auxilliary utilities and has some extra hosting resources).

HalCanary commented 3 years ago

ebook-convert comes with Calibre. Providing a single-page simple HTML document for download would allow anyone with Calibre to do the conversion themselves.

I don't know Ruby and don't have the bandwidth to learn it at the moment.

One question: the post.content field is sometimes formatted with <p>, sometimes with <br>, and sometimes with newlines. Are you just looking for those strings to determine which it is? Is it ever a mixture? I ask because I like indented paragraphs in ebooks.

l1n commented 3 years ago

Flat view for posts exists and might do what you want for this, I haven't tried it with Calibre specifically since you can retrieve replies using the API instead.

Throne3d commented 3 years ago

I don't remember exactly which format we expose in the API (if we do anything to it), but in the HTML we have three different representations:

HalCanary commented 3 years ago

I hacked on my python script last night:

  1. I split it into three scripts:

    1. One downloads the entire thread and saves the data structure as a single JSON file.
    2. The second converts the JSON file to HTML. This is where all of my opinions on how to style ebooks so they look nice on a kindle or on a computer screen come in.
    3. The final script calls into Calibre's ebook-convert to convert the HTML into EPUB and MOBI, setting the --title and --authors attribute correctly, as well as putting the source URL into the --comments section.
  2. The scripts run fine on either Python2 or Python3.

  3. I send the post content through the lxml package to clean it up, resulting in well-formed HTML.

https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab


I did not know about flat view.

QuartzLibrary commented 2 years ago

Hi,

Just wanted to log here that I've written an alternative converter.


Not really looking for this to be integrated in Glowfic, but it does address a few of the concerns raised in case you'll decide to prioritise epubs in the future:

  • Downloading the images could pose a security risk

This isn't really addressed, though linking to images is easy of course.

  • Certain replies would need to be sanitized, especially if they currently involve (hacky) quoting

Currently the content is parsed, sanitised, and then serialised. A full run of the entire database would be needed to find less stringent rules that work well.

  • Our stack mostly runs on Ruby, and due to hosting constraints, we're likely to have trouble executing Python at the same time (especially needing to communicate between the processes [...])

Not sure how Rust programs are usually called from Ruby, though executing it as a static binary or separate service might work well. This could also in theory be adapted to work directly in the browser (though possibly a performance no-no given no way of caching) since CORS wouldn't be a problem for you.

relying on an external process

The program is standalone and can be compiled to a static binary.

  • [...] we have additional constraints in recent development around performance, which this is likely to hit into (loading all the pages of a post can be quite memory intensive)

Presumably caching could take up a lot of the load, but that would require more work unless a hacky file cache like the one I already implemented was used.


As a side note, one thing I missed in the API was a way to get a list of posts in a board or board_section. Not really asking for it, I can parse the website if needed in the future, but saying so in case there turns out to be something I missed.

That's it. Thank you for working on Glowfic!