ropensci / gutenbergr

Search and download public domain texts from Project Gutenberg
https://docs.ropensci.org/gutenbergr
99 stars 23 forks source link

Many common titles cannot be found on any mirror #55

Open andrewheiss opened 1 month ago

andrewheiss commented 1 month ago

For some reason, many books that worked fine a few months ago have stopped working with gutenberg_download(), regardless of mirror settings.

For example, here are 4 common Shakespearean tragedies:

library(gutenbergr)

tragedy_ids <- c(
  1524,  # Hamlet
  1532,  # King Lear
  1533,  # Macbeth
  1513   # Romeo and Juliet
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title"
)
#> Error in `dplyr::mutate()`:
#> ℹ In argument: `gutenberg_id = as.integer(gutenberg_id)`.
#> Caused by error:
#> ! `gutenberg_id` must be size 0 or 1, not 4.
#> Run `rlang::last_trace()` to see where the error occurred.
#> Warning messages:
#> 1: ! Could not download a book at http://aleph.gutenberg.org/1/5/2/1524/1524.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 2: ! Could not download a book at http://aleph.gutenberg.org/1/5/3/1532/1532.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 3: ! Could not download a book at http://aleph.gutenberg.org/1/5/3/1533/1533.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 4: ! Could not download a book at http://aleph.gutenberg.org/1/5/1/1513/1513.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 

That's typically a sign that there are issues with the mirror (see #28), so we can specify a different mirror. Every mirror, however, leads to the same error:

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://mirrors.xmission.com/gutenberg"
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://gutenberg.pglaf.org"
)

tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "https://gutenberg.nabasny.com"
)

#> 1: ! Could not download a book at https://mirrors.xmission.com/gutenberg/1/5/2/1524/1524.zip.
#> 1: ! Could not download a book at https://gutenberg.pglaf.org/1/5/2/1524/1524.zip.
#> 1: ! Could not download a book at https://gutenberg.nabasny.com/1/5/2/1524/1524.zip.

Visiting the mirror site in a browser and hunting through the file system shows that the corresponding .zip files don't exist there either:

image

That page was last edited on June 27, 2023, so I wonder if something changed on Project Gutenberg's end?

Some of these books have alternative IDs (found with gutenberg_works()) that do work, but not all. Romeo and Juliet (1513), for instance, does not, which makes it inaccessible

# Hamlet, King Lear, and Macbeth have alternative versions that work:
# 2265 - Hamlet
# 2266 - King Lear
# 2264 - Macbeth

# This works!
some_tragedies <- gutenberg_download(
  c(2265, 2266, 2264),
  meta_fields = "title"
)

# Romeo and Juliet doesn't have an alternative version, so it doesn't work, regardless of the mirror
romeo_juliet <- gutenberg_download(
  1513,
  meta_fields = "title",
  mirror = "https://gutenberg.pglaf.org"
)
#> Warning messages:
#> 1: ! Could not download a book at https://gutenberg.pglaf.org/1/5/1/1513/1513.zip.
#> ℹ The book may have been archived.
#> ℹ Alternatively, You may need to select a different mirror.
#> → See https://www.gutenberg.org/MIRRORS.ALL for options. 
#> 2: Unknown or uninitialised column: `text`. 
andrewheiss commented 1 month ago

In the meantime, based on this comment https://github.com/ropensci/gutenbergr/issues/22#issuecomment-1807167043, the http://mirror.csclub.uwaterloo.ca/gutenberg mirror does have Romeo and Juliet and the other three books with the original IDs. It's the only mirror in the list at https://www.gutenberg.org/MIRRORS.ALL that does.

It's super bizarre that all the other mirrors have stopped working with these books 🤷‍♂️

library(gutenbergr)

tragedy_ids <- c(
  1524,  # Hamlet
  1532,  # King Lear
  1533,  # Macbeth
  1513   # Romeo and Juliet
)

# This works! But only the uwaterloo.ca mirror!?!
tragedies_raw <- gutenberg_download(
  tragedy_ids,
  meta_fields = "title",
  mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg"
)
jmclawson commented 1 month ago

I'm seeing the same thing, and I wonder if Project Gutenberg has stopped including the .zip files by default. Since gutenbergr's linked list of robot rules on Project Gutenberg doesn't specify using .zip files—and in fact details how to go about using wget to request different file formats—would it be appropriate to default to .txt or .html instead?

For what it's worth, Project Gutenberg's page detailing policies for robot access looks unchanged since 2016, when gutenbergr was first published. An earlier version used until about 2012 does mention zipped files, along with a rule of waiting 2 seconds between requests, but it still mentions using wget to request other file types.

jrdnbradford commented 1 week ago

I ran into this issue too, appears as if the titles I was trying to download did not have .zips, only .txts, which fits with @jmclawson's comments. I made a patch that works for my own use case: https://github.com/jrdnbradford/gutenbergr/tree/download-txt-fix.

@andrewheiss and @jmclawson if you want to try it you can install using:

devtools::install_github(
  repo = "jrdnbradford/gutenbergr",
  ref = "e3b804e9be0cd7bc7be87f8d6e5ed9d6d222cb87"
)

I'd be happy to put in a PR if a maintainer can let me know if my current solution works for them. 🚀

jonthegeek commented 1 week ago

@jrdnbradford I haven't had a chance to look at this at all, but we'd love a PR! Do you have specific questions, or just a general "will a PR matter?"

jrdnbradford commented 1 week ago

@jonthegeek just wanted to make sure my solution wasn't horribly wrong before putting in a PR. 😄 I went ahead and put in PR https://github.com/ropensci/gutenbergr/pull/57. Happy to make changes if needed, especially if there's some Gutenberg scraping policy I missed.