tidyverse / lubridate

Make working with dates in R just that little bit easier
https://lubridate.tidyverse.org
GNU General Public License v3.0
724 stars 207 forks source link

Inconsistent behavior of `parse_date_time()` inside `dplyr::mutate()` #1149

Open matiasandina opened 7 months ago

matiasandina commented 7 months ago

Let's say a string is contaminated with a letter (e.g., q2/26/2023 18:35:46). This should return NA, and it does when calling parse_date_time() directly (see bottom of reprex), but it's not behaving in that way when being used from inside dplyr::mutate()

# lubridate is failing to parse here
library(tidyverse)
toy <- structure(list(`MM:DD:YYYY hh:mm:ss` = c("12/26/2023 18:35:28", 
                                                "q2/26/2023 18:35:46", "12/26/2023 18:35:59", "12/26/2023 18:36:13", 
                                                "12/26/2023 18:36:30")), row.names = c(NA, -5L), 
                 spec = structure(list(
                   cols = list(`MM:DD:YYYY hh:mm:ss` = structure(list(), 
                                                                 class = c("collector_character", "collector"))), 
                   default = structure(list(), class = c("collector_guess", "collector")), delim = ","), 
                   class = "col_spec"),  
                 class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
toy
#> # A tibble: 5 × 1
#>   `MM:DD:YYYY hh:mm:ss`
#>   <chr>                
#> 1 12/26/2023 18:35:28  
#> 2 q2/26/2023 18:35:46  ## THIS SHOULD FAIL TO PARSE
#> 3 12/26/2023 18:35:59  
#> 4 12/26/2023 18:36:13  
#> 5 12/26/2023 18:36:30

Now let's try to parse into datetime using lubridate and clock.

toy %>% 
  dplyr::mutate(lubridate= lubridate::parse_date_time(`MM:DD:YYYY hh:mm:ss`, 
                                                      "mdy HMS"), 
                clock = clock::date_time_parse(`MM:DD:YYYY hh:mm:ss`, 
                                               "UTC",
                                               format = "%m/%d/%Y %H:%M:%S"))
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `clock = clock::date_time_parse(`MM:DD:YYYY hh:mm:ss`, "UTC",
#>   format = "%m/%d/%Y %H:%M:%S")`.
#> Caused by warning:
#> ! Failed to parse 1 string at location 2. Returning `NA` at that location.
#> # A tibble: 5 × 3
#>   `MM:DD:YYYY hh:mm:ss` lubridate           clock              
#>   <chr>                 <dttm>              <dttm>             
#> 1 12/26/2023 18:35:28   2023-12-26 18:35:28 2023-12-26 18:35:28
#> 2 q2/26/2023 18:35:46   2023-02-26 18:35:46 NA                 
#> 3 12/26/2023 18:35:59   2023-12-26 18:35:59 2023-12-26 18:35:59
#> 4 12/26/2023 18:36:13   2023-12-26 18:36:13 2023-12-26 18:36:13
#> 5 12/26/2023 18:36:30   2023-12-26 18:36:30 2023-12-26 18:36:30

When calling it directly, lubridate returns NA as expected.

# But this fails to parse, proper behavior of lubridate
lubridate::parse_date_time("q2/26/2023 18:35:46", "mdy HMS")
#> Warning: All formats failed to parse. No formats found.
#> [1] NA

I think this might go beyond dplyr:mutate(). Using a vectorized approach has issues too.

parse_date_time(toy$`MM:DD:YYYY hh:mm:ss`, "mdY HMS")
[1] "2023-12-26 18:35:28 UTC" "2023-02-26 18:35:46 UTC" "2023-12-26 18:35:59 UTC" "2023-12-26 18:36:13 UTC"
[5] "2023-12-26 18:36:30 UTC"
> packageVersion("lubridate")
[1] ‘1.9.2’

Created on 2023-12-27 with reprex v2.0.2

matiasandina commented 4 months ago

@rion-saeon Sorry if I wasn't clear. I was trying to make this issue being as much on point as possible. I will give more context to help clarify and try to address your points:

I believe the core of the issue still holds: parse_date_time() should either return NA both when called directly or inside a mutate. The behavior that I saw with incorrect conversion to a valid but incorrect datetime might potentially affect other people.

Other people have reported weird behavior in similar issues. See here and here.

rion-saeon commented 4 months ago

@matiasandina OK cool and actually, thanks for pointing this potential bug, out. I now understand that this could be a bug and needs to be looked at because, I am also running parse_date_time inside of mutate but there shouldn't be corruption of my data. Instead, I specify different timestamp formats that come in which, similarly to you, I have little control over (data downloaded onto different computers cause annoying timestamp discrepancies among datasets).

I also agree that NA should be the best result as one may end up incorporating more errors or only resolve some, when wanting to rectify things using elaborate workarounds.

However, if one starts to explore your corrupted data there would be a pattern you should be able to get around to using if_else or case_when functions, regex and stringr package.

In any case, it seems like it will be wise to move over to clock to be safe and, the package seems to be more advanced. Pity it's not part of the tidyverse as I am a bit of a tidy' snob.

Finally, I am thus, suprised and somewhat disappointed that this has not been looked at since your OP!

PS I deleted my post as it's irrelevant at this point.