Inconsistent behavior of `parse_date_time()` inside `dplyr::mutate()`

matiasandina commented 7 months ago

Let's say a string is contaminated with a letter (e.g., q2/26/2023 18:35:46). This should return NA, and it does when calling parse_date_time() directly (see bottom of reprex), but it's not behaving in that way when being used from inside dplyr::mutate()

# lubridate is failing to parse here
library(tidyverse)
toy <- structure(list(`MM:DD:YYYY hh:mm:ss` = c("12/26/2023 18:35:28", 
                                                "q2/26/2023 18:35:46", "12/26/2023 18:35:59", "12/26/2023 18:36:13", 
                                                "12/26/2023 18:36:30")), row.names = c(NA, -5L), 
                 spec = structure(list(
                   cols = list(`MM:DD:YYYY hh:mm:ss` = structure(list(), 
                                                                 class = c("collector_character", "collector"))), 
                   default = structure(list(), class = c("collector_guess", "collector")), delim = ","), 
                   class = "col_spec"),  
                 class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
toy
#> # A tibble: 5 × 1
#>   `MM:DD:YYYY hh:mm:ss`
#>   <chr>                
#> 1 12/26/2023 18:35:28  
#> 2 q2/26/2023 18:35:46  ## THIS SHOULD FAIL TO PARSE
#> 3 12/26/2023 18:35:59  
#> 4 12/26/2023 18:36:13  
#> 5 12/26/2023 18:36:30

Now let's try to parse into datetime using lubridate and clock.

toy %>% 
  dplyr::mutate(lubridate= lubridate::parse_date_time(`MM:DD:YYYY hh:mm:ss`, 
                                                      "mdy HMS"), 
                clock = clock::date_time_parse(`MM:DD:YYYY hh:mm:ss`, 
                                               "UTC",
                                               format = "%m/%d/%Y %H:%M:%S"))
#> Warning: There was 1 warning in `dplyr::mutate()`.
#> ℹ In argument: `clock = clock::date_time_parse(`MM:DD:YYYY hh:mm:ss`, "UTC",
#>   format = "%m/%d/%Y %H:%M:%S")`.
#> Caused by warning:
#> ! Failed to parse 1 string at location 2. Returning `NA` at that location.
#> # A tibble: 5 × 3
#>   `MM:DD:YYYY hh:mm:ss` lubridate           clock              
#>   <chr>                 <dttm>              <dttm>             
#> 1 12/26/2023 18:35:28   2023-12-26 18:35:28 2023-12-26 18:35:28
#> 2 q2/26/2023 18:35:46   2023-02-26 18:35:46 NA                 
#> 3 12/26/2023 18:35:59   2023-12-26 18:35:59 2023-12-26 18:35:59
#> 4 12/26/2023 18:36:13   2023-12-26 18:36:13 2023-12-26 18:36:13
#> 5 12/26/2023 18:36:30   2023-12-26 18:36:30 2023-12-26 18:36:30

When calling it directly, lubridate returns NA as expected.

# But this fails to parse, proper behavior of lubridate
lubridate::parse_date_time("q2/26/2023 18:35:46", "mdy HMS")
#> Warning: All formats failed to parse. No formats found.
#> [1] NA

I think this might go beyond dplyr:mutate(). Using a vectorized approach has issues too.

parse_date_time(toy$`MM:DD:YYYY hh:mm:ss`, "mdY HMS")
[1] "2023-12-26 18:35:28 UTC" "2023-02-26 18:35:46 UTC" "2023-12-26 18:35:59 UTC" "2023-12-26 18:36:13 UTC"
[5] "2023-12-26 18:36:30 UTC"

> packageVersion("lubridate")
[1] ‘1.9.2’

^{Created on 2023-12-27 with reprex v2.0.2}

matiasandina commented 4 months ago

@rion-saeon Sorry if I wasn't clear. I was trying to make this issue being as much on point as possible. I will give more context to help clarify and try to address your points:

The data in toy is a sample of a column from a dataset collected from microcontroller-based hardware that saves it at CSV.
Data contamination occurs due to noisy electrical transmission lines or other buggy software bit I might have little control over. Therefore, it is potentially challenging to know how to "fix" the random contamination using str_replace and maybe even scientifically wrong to do so given the uncertainty of the actual timestamp (i.e., in this case the month needed to be fixed, which has lesser uncertainty but we don't get to choose how electrical noise corrupts the data). Instead, I would prefer to have it be NA as the coercion normally fails under parse_date_time(). We can discuss about filling NAs, but that's beyond the point of reporting the bug concisely here.
I have not written the library these devices use and have little to no control over the choices made. Hence, as much as I wanted to name the data however I pleased, I have to work with what the device gives me.
Since I am the developer of an R package that munges data from these devices, I do control what to do with corrupted data. I use parse_date_time to shift from the string I receive from CSV to properly formatted datetime. Having inconsistent behavior of the function broke my package (i.e., when using it inside mutate, you don't get NA but another form of more insidious corruption: a valid datetime which is incorrect!).
Finally, I am not asking how to fix the corruption on this string. I'm documenting what I believe is a bug. I used the comparison from the clock package to show what I believe should be the expected behavior in lubridate.

I believe the core of the issue still holds: parse_date_time() should either return NA both when called directly or inside a mutate. The behavior that I saw with incorrect conversion to a valid but incorrect datetime might potentially affect other people.

Other people have reported weird behavior in similar issues. See here and here.

rion-saeon commented 4 months ago

@matiasandina OK cool and actually, thanks for pointing this potential bug, out. I now understand that this could be a bug and needs to be looked at because, I am also running parse_date_time inside of mutate but there shouldn't be corruption of my data. Instead, I specify different timestamp formats that come in which, similarly to you, I have little control over (data downloaded onto different computers cause annoying timestamp discrepancies among datasets).

I also agree that NA should be the best result as one may end up incorporating more errors or only resolve some, when wanting to rectify things using elaborate workarounds.

However, if one starts to explore your corrupted data there would be a pattern you should be able to get around to using if_else or case_when functions, regex and stringr package.

In any case, it seems like it will be wise to move over to clock to be safe and, the package seems to be more advanced. Pity it's not part of the tidyverse as I am a bit of a tidy' snob.

Finally, I am thus, suprised and somewhat disappointed that this has not been looked at since your OP!

PS I deleted my post as it's irrelevant at this point.

tidyverse / lubridate

Inconsistent behavior of `parse_date_time()` inside `dplyr::mutate()` #1149