tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

leading zeroes in writing years before 1000 #462

Open keesdeschepper opened 2 years ago

keesdeschepper commented 2 years ago

When outputting a date with a year before 1000, there may be a difference in OS in whether there are leading zeroes in the string representation.

library(tidyverse)
as.Date("0020-02-13")
format_csv(tibble(x = as.Date("0020-02-13")))

My Windows

# [1] "0020-02-13"
# [1] "x\n0020-02-13\n"

My Linux

# [1] "20-02-13"
# [1] "x\n20-02-13\n"

I understand that it is hard to account for every OS-specific quirk, but readr itself does not by default guess that "20-02-13" is a date.

Would something like this (probably a Base R variant instead) work?

output_column.Date <- function(x, name){
  str_pad(as.character(x), 10, pad = "0")
}
keesdeschepper commented 2 years ago

And perhaps some whitespace trimming:

"Everyone agrees that years from 1000 to 9999 should be printed with 4 digits, but the standards do not define what is to be done outside that range. For years 0 to 999 most OSes pad with zeros or spaces to 4 characters, and Linux outputs just the number." https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

jennybc commented 2 years ago

I'm going to transfer this to vroom, which is where it would need to be implemented, since readr 2e (readr "second edition") actually calls / is vroom and we're trying to regard readr 1e as practically frozen.

That being said, though, it may also get closed before long in vroom as something we have no near-term plans to address. I think this is rare enough and easy enough for a user to handle that we will probably leave it that way. The user can pre-transform such a column to character and do the left padding at that time.

To my mind, this is a special case of more general output formatters (e.g. formatting something as currency), which we could imagine doing, but we that we don't currently do.

jennybc commented 2 years ago

And just to make sure there's no room for confusion:

readr itself does not by default guess that "20-02-13" is a date

This particular issue has nothing to do with guessing. You're working with a column that is stored as a Date (or, more generally, it could be a datetime) and this about how vroom formats a known-to-be-datetime column at write time.

keesdeschepper commented 2 years ago

Thanks @jennybc! I see I created some confusion -- sorry about that. Let me try to clear it up.

For me the most problematic part about this issue is that you cannot do round-trips, so use vroom_write on a df creating a csv file and then use vroom on that csv file resulting in more or less the same df. When reading a file, vroom does not accept "20-02-13" as a Date, so a column containing this value will be seen as a character column. This is the desired behavior, which shifts the focus to preventing dates being output to a csv file as "20-02-12". I guess it is not strictly a bug that vroom_write outputs the date "0020-02-13" as "20-02-13" on Linux, but to me it is in the spirit of vroom that round-trips should be possible.

I totally agree that this is rare. In my case it was some user mistyping "2020" as "0020" and I would bet that in actual practice most years before 1000 are typing errors. Yet typing errors occur and some result in valid dates. The most important thing is that they do not break the code. For me it was half a day finding out what was wrong (since the behavior differed between OS'es -- but maybe I'm just a bad debugger).

In my view there is a difference between this formatting and currency formatting. I wouldn't know if the default representation of currency is "9.99" or "$9.99" and I don't expect default smart processing of "$9.99" (which vroom doesn't). I do expect default smart processing of date columns, so I'd like output formatting that anticipates this.

If you agree that this should be handled then I can take care of the commit etc. There is an output_column.POSIXt:

function(x) {
    format(x, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC",  justify = "none")
}

so an output_column.Date does not seem crazy.

DavisVaughan commented 2 years ago

This is documented in ?strptime under the section Printing years

For years 0 to 999 most OSes pad with zeros or spaces to 4 characters, and Linux outputs just the number.

This section may be practically useful

Some platforms support modifiers from POSIX 2008 (and others). On Linux the format "%04Y" assures a minimum of four characters and zero-padding. The internal code (as used on Windows and by default on macOS) uses zero-padding by default, and formats %_4Y and %_Y can be used for space padding and no padding.

On my Mac I can indeed use these alternative formats

format(as.Date("0020-01-05"), "%Y-%m-%d")
#> [1] "0020-01-05"

format(as.Date("0020-01-05"), "%_4Y-%m-%d")
#> [1] "  20-01-05"

format(as.Date("0020-01-05"), "%_Y-%m-%d")
#> [1] "20-01-05"

So maybe if you detect that you are on Linux you could use %04Y instead of %Y.

I do think that output_column.Date would need to be defined for this, and then output_column.POSIXt should probably be updated to do this as well. I'm imagining something like this:

output_column.Date <- function(x) {
  if (is_linux()) {
    # To guarantee a minimum of 4 digits in the year
    format <- "%04Y-%m-%d"
  } else {
    format <- "%Y-%m-%d"
  }

  format(x, format = format)
}

See also, a python issue about this https://stackoverflow.com/questions/43630259/parse-dates-from-before-the-year-1000

jennybc commented 2 years ago

%04Y seems to be harmless on macOS, with the n = 1 of "works on my machine":

format(as.Date("0020-01-05"), "%04Y-%m-%d")
#> [1] "0020-01-05"

So is there an argument for using it unconditionally in output_column.Date()?

I do find these notions very compelling in terms of doing something here:

DavisVaughan commented 2 years ago

I can't speak to the harmlessness of using %04Y everywhere. It feels to me like that should have errored on a Mac, even though it doesn't (it "works" for me too)

There are some python reports of it erroring on windows https://github.com/singer-io/singer-python/issues/86

jennybc commented 2 years ago

Maybe putting the format() call inside a tryCatch() is the way to go. I am on board with the basic idea of trying to force a 4-character year.

keesdeschepper commented 2 years ago

Great! Shall I take care of the pull request or do you guys prefer to do it?

jennybc commented 2 years ago

We are working towards a release, so this sort of modest change fits into our current work. We'll do it.

DavisVaughan commented 1 year ago

I'm going to use the "04Y on Linux" approach here https://github.com/DavisVaughan/almanac/pull/84/commits/90ce2a197cd5a18002c2b4c6835443118e98e228