tidyverse / lubridate

Make working with dates in R just that little bit easier
https://lubridate.tidyverse.org
GNU General Public License v3.0
728 stars 207 forks source link

Parsing issues when combining as_date and across with multiple date formats #1042

Closed ad1729 closed 1 year ago

ad1729 commented 2 years ago

Hi,

I have a data frame with multiple date columns (encoded as characters) where the date format can differ between columns (e.g. col1 has the following format YYYY-MM-DD and col2 is coded as DD/MM/YYYY. What I'd like to do is supply multiple formats while converting the char columns to date columns.

Using lubridate::as_date(..., format = c(...)), lubridate version is 1.8.0

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ lubridate::as_date(.x, format = c("%Y-%m-%d", "%d/%m/%Y"))
            )
        )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 NA        
#> 2 NA         2020-11-03

packageVersion("lubridate")
#> [1] '1.8.0'

packageVersion("dplyr")
#> [1] ‘1.0.8’

Not sure why this fails to convert the first element in date2 from 03/10/2020 to 2020-10-03.

On the other hand, using base R's as.Date(..., tryFormats = c(...)) works as expected / desired.

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ as.Date(.x, tryFormats = c("%Y-%m-%d", "%d/%m/%Y"))
            )
        )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 2020-10-03
#> 2 NA         2020-11-03

Additionally, using as_date(parse_date_time(..., orders = c(...))) gives the right result too on this example.

tibble(
    date1 = c("2020-10-02", NA_character_), 
    date2 = c("03/10/2020", "03/11/2020")
    ) %>% 
    mutate(
        across(
            .cols = contains("date"), 
            .fns = ~ lubridate::as_date(lubridate::parse_date_time(.x, orders = c("%Y-%m-%d", "%d/%m/%Y")))
        )
    )
#> # A tibble: 2 × 2
#>   date1      date2     
#>   <date>     <date>    
#> 1 2020-10-02 2020-10-03
#> 2 NA         2020-11-03

What is unexpected / surprising is that as_date(...) accepts multiple formats but generates NAs where there should be data. As far as I can tell, this only seems to be happening when combining as_date and across. If there are multiple formats in one single column then as_date works as expected

tibble(
    date1 = c("2020-10-02", "03/10/2020", NA_character_)
    ) %>% 
    mutate(
        date1_1 = lubridate::as_date(date1, format = "%Y-%m-%d"), 
        date1_2 = lubridate::as_date(date1, format = c("%Y-%m-%d", "%d/%m/%Y")))
#> # A tibble: 3 × 3
#>   date1      date1_1    date1_2   
#>   <chr>      <date>     <date>    
#> 1 2020-10-02 2020-10-02 2020-10-02
#> 2 03/10/2020 NA         2020-10-03
#> 3 <NA>       NA         NA
vspinu commented 1 year ago

I am also puzzled by this behavior. Internally in as_date we were using strptime which does seem to support multiple formats, though it's not documented, but then it breaks inside dplyr across. So I have moved away from strptime to parse_date_time(exact = TRUE) to have a more predictable behavior and explicitly allow multiple formats in format argument.

ad1729 commented 1 year ago

Thanks for looking into this and fixing it!

canadice commented 1 year ago

Did you return to strptime in version 1.9.1?

I had to change the following code from prior to 1.9.0 as_date("Jan 16 2023", format = "%b %d %Y") to as_date("Jan 16 2023", format = "bdY") to get the same output in 1.9.0 but now in 1.9.1 format = "bdY" produces NA and format = "%b %d %Y" produces the correct output once more.

vspinu commented 1 year ago

@canadice No, the code in 1.9.0 was a regression. We didn't return to strtime, but parse with exact = TRUE to comply with as.Date. Also as_datetime and as_date now behave similarly. For instance multiple formats can be passed.

Change introduced in 3aa948d59103a3d5cf882cc725601bc5a13de797

canadice commented 1 year ago

@canadice No, the code in 1.9.0 was a regression. We didn't return to strtime, but parse with exact = TRUE to comply with as.Date. Also as_datetime and as_date now behave similarly. For instance multiple formats can be passed.

Change introduced in 3aa948d

Right, and the changes I made when updating to 1.9.0 seemed to work but now with 1.9.1 they have reversed.

To clarify: <1.9.0 lubridate as_date("Jan 16 2023", format = "%b %d %Y") produces no errors

1.9.0 lubridate as_date("Jan 16 2023", format = "%b %d %Y") produces NA as_date("Jan 16 2023", format = "bdY") produces no errors so changed to this way of writing format.

1.9.1 lubridate as_date("Jan 16 2023", format = "%b %d %Y") produces no errors as_date("Jan 16 2023", format = "bdY") produces NA

When troubleshooting the NAs in 1.9.0 I detailed it to the parse_date_time(exact = TRUE) and made appropriate changes, but now those changes does not work and the "old" format can be used again. Is this an intended feature with the new version (using "%b %d %Y") or something that will be reverted again, i.e. any changes I do will have to be redone again to the 1.9.0 solutions I had?

vspinu commented 1 year ago

yes, the exact = TRUE means that you need to specify the exact format (like in strptime) and not lubridate lax format (aka orders). So this particular behavior now is as in <1.9.0.

When troubleshooting the NAs in 1.9.0 I detailed it to the parse_date_time(exact = TRUE) and made appropriate changes, but now those changes does not work

If you used parse_date_time(exact = TRUE) you should not be affected by any of this. This change affects only as_date.