vincentarelbundock / Rdatasets

A collection of datasets originally distributed in R packages
https://vincentarelbundock.github.io/Rdatasets
Other
325 stars 436 forks source link

Exporting CSV with parameters row.names = FALSE and na = "" to avoid NA strings and get rid of row.names column? #31

Closed mskyttner closed 1 year ago

mskyttner commented 1 year ago

What a nice collection of datasets! Thanks!

I tried loading data using duckdb and noticed that some numerical columns contain "NA" strings in the exported CSV. Those load nicely in R but often not elsewhere. I also saw that the row.names column seems to be included.

For example:

$ duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
┌─────────┬─────────┬───────────┬────────────────┬───┬───────────────────┬─────────────┬─────────┬───────┐
│ column0 │ species │  island   │ bill_length_mm │ … │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│  int64  │ varchar │  varchar  │    varchar     │   │      varchar      │   varchar   │ varchar │ int64 │
├─────────┼─────────┼───────────┼────────────────┼───┼───────────────────┼─────────────┼─────────┼───────┤
│       1 │ Adelie  │ Torgersen │ 39.1           │ … │ 181               │ 3750        │ male    │  2007 │
│       2 │ Adelie  │ Torgersen │ 39.5           │ … │ 186               │ 3800        │ female  │  2007 │
│       3 │ Adelie  │ Torgersen │ 40.3           │ … │ 195               │ 3250        │ female  │  2007 │
│       4 │ Adelie  │ Torgersen │ NA             │ … │ NA                │ NA          │ NA      │  2007 │
│       5 │ Adelie  │ Torgersen │ 36.7           │ … │ 193               │ 3450        │ female  │  2007 │
├─────────┴─────────┴───────────┴────────────────┴───┴───────────────────┴─────────────┴─────────┴───────┤
│ 5 rows                                                                             9 columns (8 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Compare the data types for columns bill_length_mm bill_depth_mm flipper_length_mm body_mass_g with the R dataset:

> palmerpenguins::penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
 2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
 3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
 4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007

Could exporting using write.csv(x, na = "", row.names = FALSE) alleviate this issue? For example here.

vincentarelbundock commented 1 year ago

Hi @mskyttner , thanks for the suggestions.

The na="" argument seems sensible. However, we cannot do row.names=FALSE because the row names are sometimes informative (even essential).

mskyttner commented 1 year ago

What do you think about about exporting row.names (if those exist) as a named column "rowname" (if the dataset is not a tibble, since with those row.names are not recommended and more rarely used)?

I fiddled with this snippet to try it out:

export_csv <- function(x, f, ...) {

  if (is.data.frame(x) & !tibble::is_tibble(x)) 
    x <- tibble::as_tibble(tibble::rownames_to_column(x, var = "rowname"))

  readr::write_csv(x, file = f, na = "", ...)
}

# export some datasets (first one is a tibble, last one has rownames)
palmerpenguins::penguins |> export_csv("/tmp/pp.csv")
datasets::iris |> export_csv("/tmp/iris.csv")
datasets::mtcars |> export_csv("/tmp/mtcars.csv")

The result in that little attempt looked promising, I think, see below (mtcars got a rowname column and column types are not demoted to varchar):


# import using non-R tool
duckdb_read_csv <- function(f) 
  system(sprintf("duckdb -c 'from \"%s\" limit 4;'", f))

> duckdb_read_csv("/tmp/pp.csv")
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ species │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│ varchar │  varchar  │     double     │    double     │       int64       │    int64    │ varchar │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male    │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female  │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female  │  2007 │
│ Adelie  │ Torgersen │                │               │                   │             │         │  2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘
> duckdb_read_csv("/tmp/iris.csv")
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ Sepal.Length │ Sepal.Width │ Petal.Length │ Petal.Width │ Species │
│    double    │   double    │    double    │   double    │ varchar │
├──────────────┼─────────────┼──────────────┼─────────────┼─────────┤
│          5.1 │         3.5 │          1.4 │         0.2 │ setosa  │
│          4.9 │         3.0 │          1.4 │         0.2 │ setosa  │
│          4.7 │         3.2 │          1.3 │         0.2 │ setosa  │
│          4.6 │         3.1 │          1.5 │         0.2 │ setosa  │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
> duckdb_read_csv("/tmp/mtcars.csv")
┌────────────────┬────────┬───────┬────────┬───────┬────────┬────────┬────────┬───────┬───────┬───────┬───────┐
│    rowname     │  mpg   │  cyl  │  disp  │  hp   │  drat  │   wt   │  qsec  │  vs   │  am   │ gear  │ carb  │
│    varchar     │ double │ int64 │ double │ int64 │ double │ double │ double │ int64 │ int64 │ int64 │ int64 │
├────────────────┼────────┼───────┼────────┼───────┼────────┼────────┼────────┼───────┼───────┼───────┼───────┤
│ Mazda RX4      │   21.0 │     6 │  160.0 │   110 │    3.9 │   2.62 │  16.46 │     0 │     1 │     4 │     4 │
│ Mazda RX4 Wag  │   21.0 │     6 │  160.0 │   110 │    3.9 │  2.875 │  17.02 │     0 │     1 │     4 │     4 │
│ Datsun 710     │   22.8 │     4 │  108.0 │    93 │   3.85 │   2.32 │  18.61 │     1 │     1 │     4 │     1 │
│ Hornet 4 Drive │   21.4 │     6 │  258.0 │   110 │   3.08 │  3.215 │  19.44 │     1 │     0 │     3 │     1 │
└────────────────┴────────┴───────┴────────┴───────┴────────┴────────┴────────┴───────┴───────┴───────┴───────┘
vincentarelbundock commented 1 year ago

Yep, I think it works. With the current website we get:

duckdb -c "from 'https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv' limit 5"
# ┌──────────┬─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
# │ rownames │ species │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │ year  │
# │  int64   │ varchar │  varchar  │     double     │    double     │       int64       │    int64    │ varchar │ int64 │
# ├──────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
# │        1 │ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male    │  2007 │
# │        2 │ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female  │  2007 │
# │        3 │ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female  │  2007 │
# │        4 │ Adelie  │ Torgersen │                │               │                   │             │         │  2007 │
# │        5 │ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female  │  2007 │
# └──────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┘