Open andrewheiss opened 1 month ago
In the meantime, based on this comment https://github.com/ropensci/gutenbergr/issues/22#issuecomment-1807167043, the http://mirror.csclub.uwaterloo.ca/gutenberg mirror does have Romeo and Juliet and the other three books with the original IDs. It's the only mirror in the list at https://www.gutenberg.org/MIRRORS.ALL that does.
It's super bizarre that all the other mirrors have stopped working with these books 🤷♂️
library(gutenbergr)
tragedy_ids <- c(
1524, # Hamlet
1532, # King Lear
1533, # Macbeth
1513 # Romeo and Juliet
)
# This works! But only the uwaterloo.ca mirror!?!
tragedies_raw <- gutenberg_download(
tragedy_ids,
meta_fields = "title",
mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg"
)
I'm seeing the same thing, and I wonder if Project Gutenberg has stopped including the .zip files by default. Since gutenbergr's linked list of robot rules on Project Gutenberg doesn't specify using .zip files—and in fact details how to go about using wget to request different file formats—would it be appropriate to default to .txt or .html instead?
For what it's worth, Project Gutenberg's page detailing policies for robot access looks unchanged since 2016, when gutenbergr was first published. An earlier version used until about 2012 does mention zipped files, along with a rule of waiting 2 seconds between requests, but it still mentions using wget to request other file types.
I ran into this issue too, appears as if the titles I was trying to download did not have .zip
s, only .txt
s, which fits with @jmclawson's comments. I made a patch that works for my own use case: https://github.com/jrdnbradford/gutenbergr/tree/download-txt-fix.
@andrewheiss and @jmclawson if you want to try it you can install using:
devtools::install_github(
repo = "jrdnbradford/gutenbergr",
ref = "e3b804e9be0cd7bc7be87f8d6e5ed9d6d222cb87"
)
I'd be happy to put in a PR if a maintainer can let me know if my current solution works for them. 🚀
@jrdnbradford I haven't had a chance to look at this at all, but we'd love a PR! Do you have specific questions, or just a general "will a PR matter?"
@jonthegeek just wanted to make sure my solution wasn't horribly wrong before putting in a PR. 😄 I went ahead and put in PR https://github.com/ropensci/gutenbergr/pull/57. Happy to make changes if needed, especially if there's some Gutenberg scraping policy I missed.
For some reason, many books that worked fine a few months ago have stopped working with
gutenberg_download()
, regardless of mirror settings.For example, here are 4 common Shakespearean tragedies:
That's typically a sign that there are issues with the mirror (see #28), so we can specify a different mirror. Every mirror, however, leads to the same error:
Visiting the mirror site in a browser and hunting through the file system shows that the corresponding .zip files don't exist there either:
That page was last edited on June 27, 2023, so I wonder if something changed on Project Gutenberg's end?
Some of these books have alternative IDs (found with
gutenberg_works()
) that do work, but not all. Romeo and Juliet (1513), for instance, does not, which makes it inaccessible