Closed zdbmdc closed 10 months ago
I admit I am one of those irritating newbies. Found the github thread. Read rkb965 's commented on Mar 22. Tried it and appears to have written successfuly. Just in case, FYI, I have arrow version 12.0.1. Thanks to all you blazing the trail.
I've had the exact same problem.
In the R4DS2e book, the glimpse
returns :
seattle_csv |> glimpse()
#> FileSystemDataset with 1 csv file
#> 41,389,465 rows x 12 columns
#> $ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Ph…
#> $ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
#> $ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
#> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
#> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
#> $ Title <string> "Super rich : a guide to having it all / Russell S…
#> $ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
#> $ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
#> $ Subjects <string> "Self realization, Conduct of life, Attitude Psych…
#> $ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
But in fact on my computer ISBN is inferred to be null by arrow, as in the cell just before that in the R4DS book :
seattle_csv
#> FileSystemDataset with 1 csv file
#> UsageClass: string
#> CheckoutType: string
#> MaterialType: string
#> CheckoutYear: int64
#> CheckoutMonth: int64
#> Checkouts: int64
#> Title: string
#> ISBN: null
#> Creator: string
#> Subjects: string
#> Publisher: string
#> PublicationYear: string
That's because arrow
only checks the first few thousands rows, which are indeed null
, but later string
cells do come in. The following code, upon loading seattle_csv, solved that problem. There's probably a cleaner way to do it, I'd be grateful if anyone explained it to me!
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv"
)
seattle_csv$schema$ISBN <- string()
Cheers!
I had this problem too. This StackOverflow response suggests defining the schema when the dataset is opened, which generally sounds like a better idea (do not create the bug) than fixing the schema after opening the dataset. So I constructed the schema, taking the opportunity to restrict the size of some integers
schema <- schema(
UsageClass = string(),
CheckoutType = string(),
MaterialType = string(),
CheckoutYear = int32(),
CheckoutMonth = int32(),
Checkouts = int32(),
Title = string(),
ISBN = string(),
Creator = string(),
Subjects = string(),
Publisher = string(),
PublicationYear = string()
)
Initially I didn't read the StackOverflow response closely enough, and tried
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv",
schema = schema
)
This fails, because when the schema is provided, the first line is interpreted as a row of data, rather than a column header.
In lieu of actually reading the StackOverflow response carefully, I looked at the documentation for ?open_dataset
and specifically the ...
argument. These are passed to, e.g., read_csv_arrow()
, and the help page there ?read_csv_arrow
clarifies the behavior of specifying schema=
. The StackOverflow post works around the behavior of schema=
by skipping the first row (skip = 1
); I worked around it by specifying col_types=
which seems a bit more R-like
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv",
col_types = schema
)
Maybe it would be better to use open_csv_dataset()
, which explicitly names the relevant arguments instead of relying on ...
.
Two things that came up, and that are not really issues for r4ds
?open_csv_dataset
does not go into the subtle interactions of schema
/ col_names
/ col_types
the way that ?read_csv_arrow
does.Error: Invalid: In CSV column #7
, but R counts from 1 and under this number system ISBN is the 8th column (an unambiguous error message might mention CSV column 'ISBN'
).I believe this is also related to #1374 and #1533 so closing this in favor of those.
Hi, love this edition. 23.4.3 Rewriting the Seattle library data
I have read code of conduct I do not see how you can reply so I am including my email. I am in the United States. zdbmdc@Yahoo.com