tidyverse / googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
https://googlesheets4.tidyverse.org
Other
360 stars 53 forks source link

datetimes are different after gs4 write/read round trip #273

Closed allenbaron closed 1 year ago

allenbaron commented 2 years ago

Writing datetimes to a Google Sheet with googlesheets4 and then re-reading it with the column formatted as a datetime alters the value. The change is very small and is not always noticeable with R's default print options.

I've included only a single datetime in the reprex because it shows a difference after printing.

llibrary(googledrive)
library(googlesheets4)
library(readr)

# google drive auth here

ss <- googlesheets4::gs4_create("gs4_dt_roundtrip")
#> ✔ Creating new Sheet: "gs4_dt_roundtrip".

df <- data.frame(
  dt = structure(1660163842, tzone = "UTC", class = c("POSIXct","POSIXt"))
)
df
#>                    dt
#> 1 2022-08-10 20:37:22

googlesheets4::sheet_write(df, ss, sheet = 1)
#> ✔ Writing to "gs4_dt_roundtrip".
#> ✔ Writing to sheet 'Sheet1'.

# Re-read as datetime = DIFFERENT VALUE -----------------------------------
df2 <- googlesheets4::read_sheet(ss, sheet = 1, col_types = "T")
#> ✔ Reading from "gs4_dt_roundtrip".
#> ✔ Range ''Sheet1''.

df2
#> # A tibble: 1 × 1
#>   dt                 
#>   <dttm>             
#> 1 2022-08-10 20:37:21

df$dt - df2$dt
#> Time difference of 4.768372e-07 secs
print(as.numeric(df2$dt), digits = 20)
#> [1] 1660163841.9999995232

# Re-read as character = SAME VALUE ---------------------------------------
df_chr <- googlesheets4::read_sheet(ss, sheet = 1, col_types = "c")
#> ✔ Reading from "gs4_dt_roundtrip".
#> ✔ Range ''Sheet1''.

df_chr$dt <- readr::parse_guess(df_chr$dt)
df_chr
#> # A tibble: 1 × 1
#>   dt                 
#>   <dttm>             
#> 1 2022-08-10 20:37:22

df$dt - df_chr$dt
#> Time difference of 0 secs
print(as.numeric(df_chr$dt), digits = 20)
#> [1] 1660163842

googledrive::drive_trash(ss)
#> File trashed:
#> • 'gs4_dt_roundtrip' <id: 1bJX_GYI1eAx11tC1TgKlfOfpSaTFywPhQqcafKB8MG4>
Created on 2022-10-24 with [reprex v2.0.2](https://reprex.tidyverse.org/)
jennybc commented 1 year ago

I'm doing issue triage prior to a release that's under deadline.

But from a quick read, I think this may be a combination of R FAQ 7.31 (https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f) with a light sprinkling of (insert something about POSIXct formatting here). I feel like there's something I can point to about POSIXct formatting and hopefully I'll come back to add it.

This is likely a #wontfix.

jennybc commented 1 year ago

Coming back just to park a small example from @DavisVaughan about some of the joys of POSIXct. Again, I'm not really digging it to this just yet, but want to leave myself some notes.

> options(digits.secs = 20)
> Sys.time()
[1] "2023-03-23 15:47:08.864708 EDT"
> unclass(Sys.time())
[1] 1679600841
> options(digits = 20)
> unclass(Sys.time())
[1] 1679600848.7778339386
DavisVaughan commented 1 year ago

A little investigation of my own.

The date-time we used above is:

1660163842

converting that to google sheet form using the algorithm in as_CellData.POSIXct gives us:

# (1660163842 / 86400) + 25569
44783.859282407407591

now, deep in the stack we eventually call httr::POST() to send the data over the wire to google sheets. This ends up calling jsonlite::toJSON(body, auto_unbox = TRUE, digits = 22) which gives us:

44783.8592824074

So we lost some precision. And indeed if you go to the google sheet that was created and convert the cell to Number format you'll see 44783.8592824074 in there.

If you reverse the algorithm with this you get:

# (44783.8592824074 - 25569) * 86400
1660163841.9999992847

# displays as "2022-08-10 20:37:21 UTC"

It is also worth noting that googlesheets4 doesn't "reverse the algorithm" in exactly this way, but i think it should. Instead inside googlesheets4:::as_datetime() it ends up doing the multiplication first, then the subtraction, like:

# (44783.8592824074 * 86400) - (25569 * 86400) 
1660163841.9999995232

Note that doing the math that way gave a slightly different answer than above, and I think it is slightly less correct, but this does reproduce the time difference of 4.768372e-07 reported in the original comment above.

DavisVaughan commented 1 year ago

And that jsonlite failure seems to come from an internal helper called num_to_char() which is a C algorithm that seems to differ from format()

Browse[14]> jsonlite:::num_to_char(44783.859282407407591, digits = 22)
[1] "44783.8592824074"
Browse[14]> format(44783.859282407407591, digits = 22)
[1] "44783.85928240740759065"
DavisVaughan commented 1 year ago

It seems to limit digits to 17 at some point https://github.com/jeroen/jsonlite/blob/467923567ce2c2d535714cc3908a6912000a29ae/src/num_to_char.c#L65

@jeroen, any idea why that is?

DavisVaughan commented 1 year ago

Ha oh shoot if you just update to dev jsonlite then the jsonlite part of this is actually fixed:

> unclass(df$dt)/86400 + 25569
[1] 44783.85928240740759065
attr(,"tzone")
[1] "UTC"
> jsonlite::fromJSON(jsonlite::toJSON(unclass(df$dt)/86400 + 25569, digits = 22))
[1] 44783.85928240740759065

Fixed by https://github.com/jeroen/jsonlite/pull/413, which raised the cap from 15 to 17 digits (and 17 actually works)

It was argued for here https://github.com/jeroen/jsonlite/issues/350, since IEEE754 doubles converted to string with 17 significant digits must be able to roundtrip themselves

jennybc commented 1 year ago

I did a little study of this and, at least with this example, the jsonlite fix is the only thing that touches it. Changing the order of operations in googlesheets4 when converting from a spreadsheet date time to POSIXct didn't help (in fact the time difference got worse). So I think answer here is that users who need this level of precision right now should install dev jsonlite. And presumably a released version will appear in due course.

@jeroen Can you note this down as a reason in favor of a jsonlite release? TL;DR the 17 significant digits are necessary for roundtripping datetimes between R and Google Sheets.

jeroen commented 1 year ago

OK I'll do that today.

jeroen commented 1 year ago

Where exactly does jsonlite get invoked in this stack? We might have to backpaddle a little bit on the default in jsonlite because of https://github.com/jeroen/jsonlite/issues/420, but if you explicitly set digits = 17 then you should be fine.

jennybc commented 1 year ago

I can't re-trace this right this moment, but I have no reason to doubt @DavisVaughan's research described above in https://github.com/tidyverse/googlesheets4/issues/273#issuecomment-1481888454.

now, deep in the stack we eventually call httr::POST() to send the data over the wire to google sheets. This ends up calling jsonlite::toJSON(body, auto_unbox = TRUE, digits = 22)

Also based on my memory, I think the jsonlite usage, on the "write" side, is likely to all be going through httr.