Open Throne3d opened 7 years ago
I once scraped a thread to make an ebook for reading on a Kindle. I removed all images, as well as the timestamp to make it look right. I also removed all extra boxes and margins.
I would love to take a look at implementing this feature server-side — maybe just output a sinmple markdown document for a given thread.
Here's my pass at this: https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab
Hey @HalCanary! Thanks for the interest. While that's an interesting proof of concept, I don't think it'll help too much for a few different reasons – I've done some work on EPUB generation before, and the issues mostly aren't in the bare-bones generation. In particular, though:
ebook-convert
, for most of the heavy lifting – we might be able to install that, but we'd need to investigate it more thoroughly (and the same hosting constraints probably get in the way here)This issue also isn't super high priority right now, and our dev efforts have been quite limited recently.
This could work for individuals looking to scrape ebooks from the site – maybe it would be useful shared in one of the communities focused around Glowfic and the Constellation (e.g. the Discord servers)?
Hal, feel free to contact me on Discord ([removed]) if you'd like to set this up on glowficrss.com (which provides some auxilliary utilities and has some extra hosting resources).
ebook-convert
comes with Calibre. Providing a single-page simple HTML document for download would allow anyone with Calibre to do the conversion themselves.
I don't know Ruby and don't have the bandwidth to learn it at the moment.
One question: the post.content field is sometimes formatted with <p>
, sometimes with <br>
, and sometimes with newlines. Are you just looking for those strings to determine which it is? Is it ever a mixture? I ask because I like indented paragraphs in ebooks.
Flat view for posts exists and might do what you want for this, I haven't tried it with Calibre specifically since you can retrieve replies using the API instead.
I don't remember exactly which format we expose in the API (if we do anything to it), but in the HTML we have three different representations:
<p>
or <br>
tags (matches /<p( [^>]*)?>/
or /<br *\/?>/
), we treat it as HTML that wants to handle its own linebreaks<blockquote>
tag (/<blockquote( |>)/
), we just replace linebreaks with <br>
- <blockquote>
and other block elements interact poorly with our simplistic auto-paragraphing below\n\n
, we wrap each section in paragraphs, filling empty paragraphs with
to ensure they display, and turning any remaining single linebreaks within paragraphs into <br />
sI hacked on my python script last night:
I split it into three scripts:
ebook-convert
to convert the HTML into EPUB and MOBI, setting the --title
and --authors
attribute correctly, as well as putting the source URL into the --comments
section.The scripts run fine on either Python2 or Python3.
I send the post content through the lxml
package to clean it up, resulting in well-formed HTML.
https://gist.github.com/HalCanary/f9291e75122e60b10ed46dbfd1f272ab
I did not know about flat view.
Hi,
Just wanted to log here that I've written an alternative converter.
Not really looking for this to be integrated in Glowfic, but it does address a few of the concerns raised in case you'll decide to prioritise epubs in the future:
- Downloading the images could pose a security risk
This isn't really addressed, though linking to images is easy of course.
- Certain replies would need to be sanitized, especially if they currently involve (hacky) quoting
Currently the content is parsed, sanitised, and then serialised. A full run of the entire database would be needed to find less stringent rules that work well.
- Our stack mostly runs on Ruby, and due to hosting constraints, we're likely to have trouble executing Python at the same time (especially needing to communicate between the processes [...])
Not sure how Rust programs are usually called from Ruby, though executing it as a static binary or separate service might work well. This could also in theory be adapted to work directly in the browser (though possibly a performance no-no given no way of caching) since CORS wouldn't be a problem for you.
relying on an external process
The program is standalone and can be compiled to a static binary.
- [...] we have additional constraints in recent development around performance, which this is likely to hit into (loading all the pages of a post can be quite memory intensive)
Presumably caching could take up a lot of the load, but that would require more work unless a hacky file cache like the one I already implemented was used.
As a side note, one thing I missed in the API was a way to get a list of post
s in a board
or board_section
.
Not really asking for it, I can parse the website if needed in the future, but saying so in case there turns out to be something I missed.
That's it. Thank you for working on Glowfic!
Rather than having site data be scraped and then processed into an EPUB format, actually offer these EPUBs from the Constellation itself.
Issues: