Open keesdeschepper opened 2 years ago
And perhaps some whitespace trimming:
"Everyone agrees that years from 1000 to 9999 should be printed with 4 digits, but the standards do not define what is to be done outside that range. For years 0 to 999 most OSes pad with zeros or spaces to 4 characters, and Linux outputs just the number." https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
I'm going to transfer this to vroom, which is where it would need to be implemented, since readr 2e (readr "second edition") actually calls / is vroom and we're trying to regard readr 1e as practically frozen.
That being said, though, it may also get closed before long in vroom as something we have no near-term plans to address. I think this is rare enough and easy enough for a user to handle that we will probably leave it that way. The user can pre-transform such a column to character and do the left padding at that time.
To my mind, this is a special case of more general output formatters (e.g. formatting something as currency), which we could imagine doing, but we that we don't currently do.
And just to make sure there's no room for confusion:
readr itself does not by default guess that "20-02-13" is a date
This particular issue has nothing to do with guessing. You're working with a column that is stored as a Date
(or, more generally, it could be a datetime) and this about how vroom formats a known-to-be-datetime column at write time.
Thanks @jennybc! I see I created some confusion -- sorry about that. Let me try to clear it up.
For me the most problematic part about this issue is that you cannot do round-trips, so use vroom_write
on a df creating a csv file and then use vroom
on that csv file resulting in more or less the same df. When reading a file, vroom
does not accept "20-02-13" as a Date, so a column containing this value will be seen as a character column. This is the desired behavior, which shifts the focus to preventing dates being output to a csv file as "20-02-12". I guess it is not strictly a bug that vroom_write
outputs the date "0020-02-13" as "20-02-13" on Linux, but to me it is in the spirit of vroom
that round-trips should be possible.
I totally agree that this is rare. In my case it was some user mistyping "2020" as "0020" and I would bet that in actual practice most years before 1000 are typing errors. Yet typing errors occur and some result in valid dates. The most important thing is that they do not break the code. For me it was half a day finding out what was wrong (since the behavior differed between OS'es -- but maybe I'm just a bad debugger).
In my view there is a difference between this formatting and currency formatting. I wouldn't know if the default representation of currency is "9.99" or "$9.99" and I don't expect default smart processing of "$9.99" (which vroom
doesn't). I do expect default smart processing of date columns, so I'd like output formatting that anticipates this.
If you agree that this should be handled then I can take care of the commit etc. There is an output_column.POSIXt
:
function(x) {
format(x, "%Y-%m-%dT%H:%M:%OSZ", tz = "UTC", justify = "none")
}
so an output_column.Date
does not seem crazy.
This is documented in ?strptime
under the section Printing years
For years 0 to 999 most OSes pad with zeros or spaces to 4 characters, and Linux outputs just the number.
This section may be practically useful
Some platforms support modifiers from POSIX 2008 (and others). On Linux the format "%04Y" assures a minimum of four characters and zero-padding. The internal code (as used on Windows and by default on macOS) uses zero-padding by default, and formats %_4Y and %_Y can be used for space padding and no padding.
On my Mac I can indeed use these alternative formats
format(as.Date("0020-01-05"), "%Y-%m-%d")
#> [1] "0020-01-05"
format(as.Date("0020-01-05"), "%_4Y-%m-%d")
#> [1] " 20-01-05"
format(as.Date("0020-01-05"), "%_Y-%m-%d")
#> [1] "20-01-05"
So maybe if you detect that you are on Linux you could use %04Y
instead of %Y
.
I do think that output_column.Date
would need to be defined for this, and then output_column.POSIXt
should probably be updated to do this as well. I'm imagining something like this:
output_column.Date <- function(x) {
if (is_linux()) {
# To guarantee a minimum of 4 digits in the year
format <- "%04Y-%m-%d"
} else {
format <- "%Y-%m-%d"
}
format(x, format = format)
}
See also, a python issue about this https://stackoverflow.com/questions/43630259/parse-dates-from-before-the-year-1000
%04Y
seems to be harmless on macOS, with the n = 1
of "works on my machine":
format(as.Date("0020-01-05"), "%04Y-%m-%d")
#> [1] "0020-01-05"
So is there an argument for using it unconditionally in output_column.Date()
?
I do find these notions very compelling in terms of doing something here:
Date
column, it feels like you should be able to write it, then read it back inDate
would be independent of the OSI can't speak to the harmlessness of using %04Y
everywhere. It feels to me like that should have errored on a Mac, even though it doesn't (it "works" for me too)
There are some python reports of it erroring on windows https://github.com/singer-io/singer-python/issues/86
Maybe putting the format()
call inside a tryCatch()
is the way to go. I am on board with the basic idea of trying to force a 4-character year.
Great! Shall I take care of the pull request or do you guys prefer to do it?
We are working towards a release, so this sort of modest change fits into our current work. We'll do it.
I'm going to use the "04Y on Linux" approach here https://github.com/DavisVaughan/almanac/pull/84/commits/90ce2a197cd5a18002c2b4c6835443118e98e228
When outputting a date with a year before 1000, there may be a difference in OS in whether there are leading zeroes in the string representation.
My Windows
My Linux
I understand that it is hard to account for every OS-specific quirk, but
readr
itself does not by default guess that "20-02-13" is a date.Would something like this (probably a Base R variant instead) work?