standardebooks / web

The source code for the Standard Ebooks website.
https://standardebooks.org
Creative Commons Zero v1.0 Universal
235 stars 65 forks source link

Add a script to easily generate the required ebook structure from a corpus of repos #8

Closed robinwhittleton closed 4 years ago

robinwhittleton commented 5 years ago

I guess you’ve got some Github API integration to rebuild the library for every release, but without it I’m finding it difficult (even with the README documentation on expected formats) to build the expected heirarchy of ebooks and ebook data. Would it be possible to add a script that, given a folder containing a set of cloned SE repos could build the expected tree of data and potentially copy it into place?

acabal commented 5 years ago

I'll email you the script I use, it is not yet ready for public consumption :)

PyroLagus commented 5 years ago

Could I have the script too, please? I've managed to get a server set up, but I'm lost on how to set up the ebooks directory. I've tried to place one of the repos in www/ebooks with the built epubs in the dist folder, but 127.0.0.1:8080/ebooks/ just displays the 404 page.

acabal commented 5 years ago

If 127.0.0.1:8080/ebooks/ is a 404, are you certain it's actually hitting 127.0.0.1:8080/ebooks/index.php? Perhaps your server is misconfigured.

You can set up an example ebook in the website like so:

mkdir -p /path/to/se/root/www/ebooks/g-k-chesterton/the-man-who-was-thursday/
git clone https://github.com/standardebooks/g-k-chesterton_the-man-who-was-thursday /path/to/se/root/www/ebooks/g-k-chesterton/the-man-who-was-thursday/
mkdir -p /path/to/se/root/www/ebooks/g-k-chesterton/the-man-who-was-thursday/dist

I think that should be it.

PyroLagus commented 5 years ago

127.0.0.1:8080/ebooks/index.php is also a 404, even with the directory tree set up like that. It definitely executes ebooks.php, but I assume it triggers this exception.

This is what I have right now. The server starts up, php files get served correctly, but the ebook listing doesn't work. I'm probably doing something wrong, but I can't see what that would be.

acabal commented 5 years ago

Unfortunately I'm not familiar with Vagrant. I suggest removing the catch block in index.php to see exactly what the exception is that's happening. However if you follow the bash script I included above, I don't see why it wouldn't work. (I haven't tested it though.)

PyroLagus commented 5 years ago

Removing the try/catch gives me this error:

Fatal error: Uncaught InvalidEbookException: Invalid repo filesystem path: /standardebooks.org/ebooks/g-k-chesterton_the-man-who-was-thursday_g-k-chesterton_the-man-who-was-thursday in /vagrant/lib/Ebook.php:66 Stack trace: #0 /vagrant/lib/Library.php(65): Ebook->__construct('/standardebooks...') #1 /vagrant/lib/Library.php(33): Library::GetEbooks() #2 /vagrant/www/ebooks/index.php(58): Library::GetEbooks('newest') #3 {main} thrown in /vagrant/lib/Ebook.php on line 66

It seems to be looking for the bare git repo but without the .git suffix? I'm not sure.

acabal commented 5 years ago

Right. You did not follow my shell script above. Try running it and it will create a nested directory hierarchy, not the flat one that you created.

PyroLagus commented 5 years ago

Okay, I made a mistake when setting the target directory for the git clone. I apparently ran git clone https://github.com/standardebooks/g-k-chesterton_the-man-who-was-thursday g-k-chesterton_the-man-who-was-thursday_g-k-chesterton_the-man-who-was-thursday for whatever reason. Sorry for wasting your time. After re-running the lines correctly and doing a --bare clone of the book repo in /standardebooks.org/ebooks/, it works like it should. It's only missing the images, which are easy enough to get from the website, but is there a nice way to generate them from the book repositories?

acabal commented 5 years ago

Not at the moment, it is automated but the code that does it is rough and part of a larger system. For now you can copy them from the website, or tweak the website code to hard-code a single image for all the books.

PyroLagus commented 5 years ago

Could you take a look at this before I do a pull request?

https://github.com/PyroLagus/standardebooks-web/commit/8b685a65b54d5a8bbf4f7885196b0490e0f75fe1

acabal commented 5 years ago

Right now I don't think I want to script pulling the compiled SE corpus from the SE web server. People are sometimes not responsible, and bandwidth and server resources are a concern.

Instead, for the person interested in building the entire SE corpus on their local machine, the process should mimic what occurs on the server:

This is more time consuming for the local user, but less resource consuming for SE. And, most people won't be doing this anyway. Even for web development it's not necessary to have the entire corpus on the local host, maybe just 2 or 3 pages of books.

So the script to write would be one that pulls the raw corpus from Github onto the local machine. I would also make sure to be polite about the timing of sequential requests, maybe space them out by one or two seconds. There are only three non-ebook repos in the SE account: tools, manual, and web.

PyroLagus commented 5 years ago

Well, the script only pulls the cover and banner images from standardebooks.org (I actually know how to generate the cover images from the sources, but the "hero"/banner images are a mystery to me), not the ebook files. It gets the ebook sources from GitHub (or any other git repository including local), so the website shouldn't have any issues. It doesn't actually add compiled ebooks (epub, mobi, etc.) since I didn't want to assume how the user uses SE tools; whether it's in the path, from the cloned repo, used via venv, etc.

I thought I made that clear in the readme and the shell script. Is there anything I could do to clarify further?

acabal commented 5 years ago

OK, I've cleaned up the script I use to deploy an ebook to the website: https://github.com/standardebooks/web/blob/master/scripts/deploy-ebook-to-www

That script is what you should use to take a raw ebook source from /standardebooks.org/ebooks/ and deploy it to the local website. It builds all the necessary images and also the distributable files, and it also updates the OPDS feed and new releases RSS feed.

So, what we need your script to do is merely scrape Github for SE ebooks and clone them all to /standardebooks.org/ebooks. Then, we can add a section in the readme for how to initialize the SE website locally. It would basically be two steps: 1. Run the script to clone all ebook sources locally, and 2. run deploy-ebook-to-www /standardebooks.org/ebooks/* to deploy everything locally.

I prefer this approach because it does not depend on the remote SE server. If it blows up in the future, or is for whatever reason inaccessible, then we can rebuild everything locally.

PyroLagus commented 5 years ago

I haven't tested them yet, but something like this? https://github.com/PyroLagus/standardebooks-web/commit/71ef62ba9bfd73801b13f087068c97177707b85d

acabal commented 5 years ago

I think pull_ebooks.sh and update_ebooks.sh can be merged in to one script, sync-ebooks. This script would accept one parameter, the directory where SE ebook sources are stored (i.e. /standardebooks.org/ebooks. Then, it would first iterate over all direct children of that folder and do git pull --rebase, and subsequently it would download any new ebooks from GitHub.

Later, we can add an option to sync-ebooks to only pull, or only download new ebooks. But we can put that aside for now.

I don't think all_ebook_urls.sh is strictly necessary. You can simply subsume that functionality into sync-ebooks.

I see you're using jq to parse GitHub output, but that's not a core utility. Since the use case here is simple can you use sed or grep instead, so that we don't add another dependency to the toolset?

PyroLagus commented 5 years ago

Sure, no problem. That said, it would still be nice to offer a way to clone only a few books, but I suppose that could just be mentioned in the readme as a one-liner (or nicely formatted three-liner) since that's just a 'while read...' loop.

acabal commented 5 years ago

I guess, but if the user just wants to clone a few then they can just do a few git clones... :)

PyroLagus commented 5 years ago

How's this? https://github.com/PyroLagus/standardebooks-web/commit/40e5658455ac6641d9f85f2f6535a2dda27138a3

acabal commented 5 years ago

Yes, that looks good. Take a look at deploy-ebook-to-www for an idea of Bash code style to apply. In particular you can copy and paste the top boilerplate for the script's help text.

In general we want to follow the Unix philosophy of "quiet unless there are problems." So by default the script should output nothing unless there are errors (and then to stderr). We can add a --verbose flag later if the user wants more details as to what 's happening.

Also, we should not use environmental variables to pass parameters to the script. Instead we should read them as flags. deploy-ebook-to-www has an example of basic flag parsing and you can adapt that to handle multiple different flags.

I'm curious as to why you trap ctrl + c. Doesn't ctrl + c just exit the script anyway? Why trap it, only to do what it already does by default?

PyroLagus commented 5 years ago

When there's a sub-process running in the foreground, ctrl+c will just kill that process and continue, which is particularly annoying in a loop. Trapping assures that the script ends.

About the flags, getopts should work fine if we don't need POSIX compatibility anyways , but then the directory to be operated on would also have to be passed via an option argument, but I suppose that's fine.

Do you really not want any progress output by default? The script can take quite a while to finish.

Edit: Seems like you can specify a required argument without a flag when using getopts by using shift.

PyroLagus commented 5 years ago

Is this fine? https://github.com/PyroLagus/standardebooks-web/commit/bd0befaaa0f6a0518c70060bd199877b12947dfc