Closed mskyttner closed 1 year ago
Hi @mskyttner , thanks for the suggestions.
The na=""
argument seems sensible. However, we cannot do row.names=FALSE
because the row names are sometimes informative (even essential).
What do you think about about exporting row.names (if those exist) as a named column "rowname" (if the dataset is not a tibble, since with those row.names are not recommended and more rarely used)?
I fiddled with this snippet to try it out:
export_csv <- function(x, f, ...) {
if (is.data.frame(x) & !tibble::is_tibble(x))
x <- tibble::as_tibble(tibble::rownames_to_column(x, var = "rowname"))
readr::write_csv(x, file = f, na = "", ...)
}
# export some datasets (first one is a tibble, last one has rownames)
palmerpenguins::penguins |> export_csv("/tmp/pp.csv")
datasets::iris |> export_csv("/tmp/iris.csv")
datasets::mtcars |> export_csv("/tmp/mtcars.csv")
The result in that little attempt looked promising, I think, see below (mtcars got a rowname column and column types are not demoted to varchar):
# import using non-R tool
duckdb_read_csv <- function(f)
system(sprintf("duckdb -c 'from \"%s\" limit 4;'", f))
> duckdb_read_csv("/tmp/pp.csv")
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ species │ island │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │ sex │ year │
│ varchar │ varchar │ double │ double │ int64 │ int64 │ varchar │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
│ Adelie │ Torgersen │ │ │ │ │ │ 2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘
> duckdb_read_csv("/tmp/iris.csv")
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length │ Sepal.Width │ Petal.Length │ Petal.Width │ Species │
│ double │ double │ double │ double │ varchar │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
│ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
│ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
│ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
> duckdb_read_csv("/tmp/mtcars.csv")
┌────────────────┬────────┬───────┬────────┬───────┬────────┬────────┬────────┬───────┬───────┬───────┬───────┐
│ rowname │ mpg │ cyl │ disp │ hp │ drat │ wt │ qsec │ vs │ am │ gear │ carb │
│ varchar │ double │ int64 │ double │ int64 │ double │ double │ double │ int64 │ int64 │ int64 │ int64 │
├────────────────┼────────┼───────┼────────┼───────┼────────┼────────┼────────┼───────┼───────┼───────┼───────┤
│ Mazda RX4 │ 21.0 │ 6 │ 160.0 │ 110 │ 3.9 │ 2.62 │ 16.46 │ 0 │ 1 │ 4 │ 4 │
│ Mazda RX4 Wag │ 21.0 │ 6 │ 160.0 │ 110 │ 3.9 │ 2.875 │ 17.02 │ 0 │ 1 │ 4 │ 4 │
│ Datsun 710 │ 22.8 │ 4 │ 108.0 │ 93 │ 3.85 │ 2.32 │ 18.61 │ 1 │ 1 │ 4 │ 1 │
│ Hornet 4 Drive │ 21.4 │ 6 │ 258.0 │ 110 │ 3.08 │ 3.215 │ 19.44 │ 1 │ 0 │ 3 │ 1 │
└────────────────┴────────┴───────┴────────┴───────┴────────┴────────┴────────┴───────┴───────┴───────┴───────┘
Yep, I think it works. With the current website we get:
duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
# ┌──────────┬─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
# │ rownames │ species │ island │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │ sex │ year │
# │ int64 │ varchar │ varchar │ double │ double │ int64 │ int64 │ varchar │ int64 │
# ├──────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
# │ 1 │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
# │ 2 │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
# │ 3 │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
# │ 4 │ Adelie │ Torgersen │ │ │ │ │ │ 2007 │
# │ 5 │ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ female │ 2007 │
# └──────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘
What a nice collection of datasets! Thanks!
I tried loading data using
duckdb
and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.For example:
Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:
Could exporting using
write.csv(x, na = "", row.names = FALSE)
alleviate this issue? For example here.