tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
999 stars 285 forks source link

Strange behavior when saving/reading data containing dates #743

Closed PMassicotte closed 6 years ago

PMassicotte commented 6 years ago

I am facing a strange behaviour when saving data containing date with write_csv(). The following example shows that the date change after saving the data with write_csv.

library(tidyverse)                                                                                                                       
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.0 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.3.4          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.7.2          ✔ stringr 1.2.0     
#> ✔ readr   1.1.1          ✔ forcats 0.2.0
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

df <- read_rds("~/Desktop/d.fit.v.01.rds") %>%                                                                                           
janitor::clean_names() %>%                                                                                                               
select(profile_filename, posixct_date_utc) %>%                                                                                           
as_tibble()                                                                                                                              

attributes(df[[2]])                                                                                                                      
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] ""

Here the first date is set to 2015-04-03 21:26:01

df                                                                                                                                       
#> # A tibble: 420 x 2
#>                    profile_filename    posixct_date_utc
#>                               <chr>              <dttm>
#>  1 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  2 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  3 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  4 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  5 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  6 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  7 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  8 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  9 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#> 10 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#> # ... with 410 more rows

tempfile <- tempfile(fileext = ".csv")                                                                                                   

If I save the data using write_csv and re-read it, we see that the date time has changed. 2015-04-03 21:26:01 -> 2015-04-04 01:26:01

write_csv(df, tempfile)                                                                                                                  
read_csv(tempfile)                                                                                                                       
#> Parsed with column specification:
#> cols(
#>   profile_filename = col_character(),
#>   posixct_date_utc = col_datetime(format = "")
#> )
#> # A tibble: 420 x 2
#>                    profile_filename    posixct_date_utc
#>                               <chr>              <dttm>
#>  1 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  2 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  3 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  4 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  5 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  6 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  7 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  8 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#>  9 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#> 10 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-04 01:26:01
#> # ... with 410 more rows

The problem is avoided when using write.csv instead of write_csv. 2015-04-03 21:26:01 stays 2015-04-03 21:26:01.

write.csv(df, tempfile)                                                                                                                  
read_csv(tempfile)                                                                                                                       
#> Warning: Missing column names filled in: 'X1' [1]
#> Parsed with column specification:
#> cols(
#>   X1 = col_integer(),
#>   profile_filename = col_character(),
#>   posixct_date_utc = col_datetime(format = "")
#> )
#> # A tibble: 420 x 3
#>       X1                 profile_filename    posixct_date_utc
#>    <int>                            <chr>              <dttm>
#>  1     1 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  2     2 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  3     3 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  4     4 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  5     5 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  6     6 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  7     7 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  8     8 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#>  9     9 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#> 10    10 GE2015.ICMP_ICEP_150403_CAST_002 2015-04-03 21:26:01
#> # ... with 410 more rows
jimhester commented 6 years ago

This is likely due to the timezone attribute on the date object and your local timezone setting.

readr always writes datetimes in ISO8601 format with a UTC timezone, so if your datetime object is not actually in UTC it will be converted to that format before writing. Running attributes(df[[2]]) should tell you the timezone of the original data.

However your example is not reproducible, I do not have the original dataset.

PMassicotte commented 6 years ago

@jimhester is there a way to attach binary data to an issue? Meanwhile, I updated my question with your suggestion.

jimhester commented 6 years ago

The tzone attribute of "" means that it uses whatever timezone your R session set to use.

However as I said readr always writes files explicitly with UTC timezone, so your data is being converted to UTC before being written. It is then read as the converted UTC timezone.

See the difference between x and y here. x is using the system timezone (in my case eastern US timezone), y is explicitly set to UTC.

Sys.timezone()
#> [1] "America/New_York"
x <- as.POSIXct("2015-04-03 21:26:01 UTC")
format(x, tz = "UTC")
#> [1] "2015-04-04 01:26:01"
attr(x, "tzone")
#> [1] ""
y <- strptime("2015-04-03 21:26:01", format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
format(y, tz = "UTC")
#> [1] "2015-04-03 21:26:01"
attr(y, "tzone")
#> [1] "UTC"
PMassicotte commented 6 years ago

Thank you @jimhester. Is it not a bit dangerous that this happens silently? Is this documented, I could not find information with a quick Google search.

jimhester commented 6 years ago

In ?write_csv it states

POSIXct's are formatted as ISO8601

Perhaps that should be extended to

POSIXct's are formatted as ISO8601 in UTC timezone

Also if you look at the data being written to the file you will see it looks like '2015-04-04T01:26:01Z', the Z indicates a zero timezone offset, e.g. UTC timezone.

The issue here really is however you got the data into R originally. While the column name claims these are UTC datetimes they are actually local times.

PMassicotte commented 6 years ago

Thank you very much @jimhester for your time. You can close this issue if you think no further action are needed. Regards, Phil

jimhester commented 6 years ago

Thank you for opening the issue, I have added additional text to clarify what happens if you supply write functions with non-UTC datetime objects.

foundinblank commented 6 years ago

this was a helpful discussion, i ran into the same problem just now. thanks!

lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/