hadley / r4ds

R for data science: a book
http://r4ds.hadley.nz
Other
4.52k stars 4.19k forks source link

Need help with 23.4.3 please. #1540

Closed zdbmdc closed 10 months ago

zdbmdc commented 1 year ago

Hi, love this edition. 23.4.3 Rewriting the Seattle library data

dir.create("H:/data", showWarnings = FALSE)  #create file<data> on H:
# 1 Tb Hard drive
#-------------
curl::multi_download(
  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  "H:/data/seattle-library-checkouts.csv",
  resume = TRUE
)
#---------------
seattle_csv <- open_dataset(
  sources = "H:/data/seattle-library-checkouts.csv", 
  format = "csv"
)
#----------------
seattle_csv |>
  count(CheckoutYear, wt = Checkouts) |> 
  arrange(CheckoutYear) |> 
  collect()
#-----------------
pq_path <- "H:/data/seattle-library-checkouts"

seattle_csv |>
   group_by(CheckoutYear) |> 
   write_dataset(path = pq_path, format = "parquet")
#------------------
Error: Invalid: In CSV column #7: Row #83240: CSV conversion error to null: invalid value '9781504752848'
#-----------------

I have read code of conduct I do not see how you can reply so I am including my email. I am in the United States. zdbmdc@Yahoo.com

zdbmdc commented 1 year ago

I admit I am one of those irritating newbies. Found the github thread. Read rkb965 's commented on Mar 22. Tried it and appears to have written successfuly. Just in case, FYI, I have arrow version 12.0.1. Thanks to all you blazing the trail.

AsdaeSunspark commented 1 year ago

I've had the exact same problem.

In the R4DS2e book, the glimpse returns :

seattle_csv |> glimpse() #> FileSystemDataset with 1 csv file #> 41,389,465 rows x 12 columns #> $ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Ph… #> $ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor… #> $ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO… #> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20… #> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,… #> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,… #> $ Title <string> "Super rich : a guide to having it all / Russell S… #> $ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", ""… #> $ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim … #> $ Subjects <string> "Self realization, Conduct of life, Attitude Psych… #> $ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di… #> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…

But in fact on my computer ISBN is inferred to be null by arrow, as in the cell just before that in the R4DS book :

seattle_csv #> FileSystemDataset with 1 csv file #> UsageClass: string #> CheckoutType: string #> MaterialType: string #> CheckoutYear: int64 #> CheckoutMonth: int64 #> Checkouts: int64 #> Title: string #> ISBN: null #> Creator: string #> Subjects: string #> Publisher: string #> PublicationYear: string

That's because arrow only checks the first few thousands rows, which are indeed null, but later string cells do come in. The following code, upon loading seattle_csv, solved that problem. There's probably a cleaner way to do it, I'd be grateful if anyone explained it to me!

seattle_csv <- open_dataset( sources = "data/seattle-library-checkouts.csv", format = "csv" )

seattle_csv$schema$ISBN <- string()

Cheers!

mtmorgan commented 1 year ago

I had this problem too. This StackOverflow response suggests defining the schema when the dataset is opened, which generally sounds like a better idea (do not create the bug) than fixing the schema after opening the dataset. So I constructed the schema, taking the opportunity to restrict the size of some integers

schema <- schema(
    UsageClass = string(),
    CheckoutType = string(),
    MaterialType = string(),
    CheckoutYear = int32(),
    CheckoutMonth = int32(),
    Checkouts = int32(),
    Title = string(),
    ISBN = string(),
    Creator = string(),
    Subjects = string(),
    Publisher = string(),
    PublicationYear = string()
)

Initially I didn't read the StackOverflow response closely enough, and tried

seattle_csv <- open_dataset(
    sources = "data/seattle-library-checkouts.csv", 
    format = "csv",
    schema = schema
)

This fails, because when the schema is provided, the first line is interpreted as a row of data, rather than a column header.

In lieu of actually reading the StackOverflow response carefully, I looked at the documentation for ?open_dataset and specifically the ... argument. These are passed to, e.g., read_csv_arrow(), and the help page there ?read_csv_arrow clarifies the behavior of specifying schema=. The StackOverflow post works around the behavior of schema= by skipping the first row (skip = 1); I worked around it by specifying col_types= which seems a bit more R-like

seattle_csv <- open_dataset(
    sources = "data/seattle-library-checkouts.csv", 
    format = "csv",
    col_types = schema
)

Maybe it would be better to use open_csv_dataset(), which explicitly names the relevant arguments instead of relying on ....

Two things that came up, and that are not really issues for r4ds

mine-cetinkaya-rundel commented 10 months ago

I believe this is also related to #1374 and #1533 so closing this in favor of those.