pharmaverse / sdtm.oak

An EDC and Data Standard agnostic SDTM data transformation engine that automates the transformation of raw clinical data in ODM format to SDTM based on standard mapping algorithms
https://pharmaverse.github.io/sdtm.oak/
Apache License 2.0
25 stars 7 forks source link

Develop a function to create ISO8601 date #10

Closed rammprasad closed 9 months ago

rammprasad commented 10 months ago

Feature Idea

Purpose Converts collected date and optional time values into the iso8601 format. https://en.wikipedia.org/wiki/ISO_8601

Functionality The feature should parse collected pieces of date and time data and reformat them into a valid iso8601 format. Partial dates are acceptable and common. Invalid dates can be encountered and should be handled as described in this specification.

EDC-collected dates generally arrive in multiple formats. Some examples are dd MON yyyy, yyyymmdd, yyyy-mm-dd

Times arrive in the hh:mm or h:mm (leading 0 implied) or hh:mm:ss or hhmm or hhmmss format

We also need to handle data arriving as a single datetime field for example dd MON yyy HH:MM:ss

Format Value ISO 8601 Result
ddMMMyyyy 12NOV2020 2020-11-12
dd MMM yyyy 12 NOV 2020 2020-11-12
dd MMM yyyyhh:mm 12 NOV 202015:15 2020-11-12T15:15
dd MMM yyyyhh:mm 03 NOV 20203:30 2020-11-03T03:30
dd MMM yyyyhh:mm UN DEC 201914:00 2019-12--T14:00
dd MMM yyyy 3 NOV 2020 2020-11-03
MMM yyyy NOV 2020 2020-11
MMM yyyy MAR 2019 2019-03
dd MMM yyyy UN APR 2018 2018-04
dd MMM yyyy UN UNK 2015 2015
MMM yyyy APR 2018 2018-04
dd MMM yyyy UK APR 2018 2018-04
dd MMM yyyy UN UNK 0 Null because 0 is not a valid year
dd MMM yyyy UN UN 2015 2015
yyyymmdd 20190212 2019-02-12
yyyymmddhh:mm 2019020202:20 2019-02-02T02:20
yyyymmddhhmm 2019-04-041045 2019-04-04T10:45
yyyymmddhhmm 20200505null 2020-05-05

Relevant Input

one or many vectors and format for each vector.

Relevant Output

A vector in ISO8601 date format.

Reproducible Example/Pseudo Code

calculate_iso_date_time(vector1,format1...)

example calculate_iso_date_time(ae_start_date,dd MON YYYY)

johnaponte commented 10 months ago

Please have a look at https://cran.r-project.org/web/packages/parsedate/index.html Maybe the functionality is already available

ramiromagno commented 10 months ago

Quick evaluation of {parsedate} on the examples provided:

library(tidyverse)
library(parsedate)
#> 
#> Attaching package: 'parsedate'
#> The following object is masked from 'package:readr':
#> 
#>     parse_date

value <-
  c(
    "12NOV2020",
    "12 NOV 2020",
    "12 NOV 202015:15",
    "03 NOV 20203:30",
    "UN DEC 201914:00",
    "3 NOV 2020",
    "NOV 2020",
    "MAR 2019",
    "UN APR 2018",
    "UN UNK 2015",
    "APR 2018",
    "UK APR 2018",
    "UN UNK 0",
    "UN UN 2015",
    "20190212",
    "2019020202:20",
    "2019-04-041045",
    "20200505null"
  )

result <-
  c(
    "2020-11-12",
    "2020-11-12",
    "2020-11-12T15:15",
    "2020-11-03T03:30",
    "2019-12--T14:00",
    "2020-11-03",
    "2020-11",
    "2019-03",
    "2018-04",
    "2015",
    "2018-04",
    "2018-04",
    NA_character_,
    "2015",
    "2019-02-12",
    "2019-02-02T02:20",
    "2019-04-04T10:45",
    "2020-05-05"
  )

spec <- tibble(value, result)

spec |>
  dplyr::mutate(
    parse_date = parsedate::parse_date(value),
    parsedate_chr = as.character(parse_date)
  )
#> # A tibble: 18 × 4
#>    value            result           parse_date          parsedate_chr      
#>    <chr>            <chr>            <dttm>              <chr>              
#>  1 12NOV2020        2020-11-12       2020-01-12 00:00:00 2020-01-12         
#>  2 12 NOV 2020      2020-11-12       2020-11-12 00:00:00 2020-11-12         
#>  3 12 NOV 202015:15 2020-11-12T15:15 2015-11-12 00:00:00 2015-11-12         
#>  4 03 NOV 20203:30  2020-11-03T03:30 2030-11-03 00:00:00 2030-11-03         
#>  5 UN DEC 201914:00 2019-12--T14:00  2023-12-01 00:00:00 2023-12-01         
#>  6 3 NOV 2020       2020-11-03       2020-11-03 00:00:00 2020-11-03         
#>  7 NOV 2020         2020-11          2020-11-01 00:00:00 2020-11-01         
#>  8 MAR 2019         2019-03          2019-03-01 00:00:00 2019-03-01         
#>  9 UN APR 2018      2018-04          2018-04-01 00:00:00 2018-04-01         
#> 10 UN UNK 2015      2015             2015-01-01 00:00:00 2015-01-01         
#> 11 APR 2018         2018-04          2018-04-01 00:00:00 2018-04-01         
#> 12 UK APR 2018      2018-04          2018-04-01 00:00:00 2018-04-01         
#> 13 UN UNK 0         <NA>             2023-10-10 01:45:05 2023-10-10 01:45:05
#> 14 UN UN 2015       2015             2015-01-01 00:00:00 2015-01-01         
#> 15 20190212         2019-02-12       2019-02-12 00:00:00 2019-02-12         
#> 16 2019020202:20    2019-02-02T02:20 2033-12-24 06:56:42 2033-12-24 06:56:42
#> 17 2019-04-041045   2019-04-04T10:45 2019-01-04 00:00:00 2019-01-04         
#> 18 20200505null     2020-05-05       2020-05-05 00:00:00 2020-05-05

Created on 2023-10-10 with reprex v2.0.2

rammprasad commented 10 months ago

Great stuff. Should we consider minimising dependencies? This is a simple feature that we can develop to suit our needs.

@edgar-manukyan @galachad - Please chime in.

rammprasad commented 10 months ago

In the meeting, It was decided to develop this functionality in the sdtm.oak package. Please self assign and start the development,

ramiromagno commented 10 months ago

Hi @rammprasad and others:

Can we agree on an interface for the function?

  1. Define a function name. I've read the general guidelines but at this early stage of development it would be helpful to have suggestions so that we get the gist of how you've been naming functions and objects in {oak}.
  2. Parameters and their types/classes?
  3. Return value type? Is it always just one type or can it vary? Is it character vector, a Date-classed vector, date-time classed vector (POSIXct and POSIXlt)? Or is it a heterogeneous object such as a list that might contain a mixture of those types?
  4. Is the time zone expected to play a role in the function functionality?
rammprasad commented 10 months ago
  1. function name - calculate_isodate(dateortime, format, ...) ... can be additional dateortime vectors and formats. For each dateortime vector, there should be a format. example function call calculate_isodate(birthdate, ddMMMyyyy, birthtime, hh:mm)
  2. dateortime - vector, format - string
  3. Return - Always one vector with iso8601 dates
  4. No timezone does not play a role.
edgar-manukyan commented 10 months ago

@rammprasad, please correct me if I am wrong:

ramiromagno commented 10 months ago

Let me see if I understood:

  1. Function name: calculate_isodate()
  2. Four inputs: 2.1 dateortime: perhaps a better name that is both applicable to dates and times, or date-times. This parameter would be a character vector. 2.2 format: also a character vector with the format. The number of elements of this vector could be one or as many as the number of elements in dateortime, i.e. use the tidyverse recycling rules: https://search.r-project.org/CRAN/refmans/vctrs/html/vec_recycle.html. 2.3 replace_na: A scalar logical value (as pointed out by @edgar-manukyan ) 2.3 Ellipsis (...) for extra arguments, to be passed on to lower-levels functions, e.g. format_ISO8601().
  3. Return value: A character vector with ISO8601 conformant dates or times.

@rammprasad: In your example: calculate_isodate(birthdate, ddMMMyyyy, birthtime, hh:mm) I am under the impression you want to use ellipsis to handle every pair of date or time and format arguments, but I think it is better to just make the function vectorized over dateortime and format.

Related to https://github.com/tidyverse/lubridate/issues/1142, https://github.com/r-lib/clock/issues/361.

ramiromagno commented 10 months ago

Packages that we might learn from:

SDTMIG v3.4 on dates: SDTMIG v3.4-DTC.pdf.

rammprasad commented 10 months ago

@ramiromagno - Sounds good to me. Shall we start putting this in the code and then make improvements? Feel free to take it as you have done quite a bit of research.

ramiromagno commented 10 months ago

Hi @rammprasad -- Sorry for answering only now. Yes, I can take this.

rammprasad commented 10 months ago
visit_date vector - c("01NOV2015", "02DEC2015", "5-DEC-2023", "25JAN2015")

calculate_iso_date_time(var_name = visit_date,format=ddmmmyyyy)

Function output - c("2015-11-01","2015-12-02","2023-12-05","2015-01-25")

Example 2

collection_date <- c("01NOV2015", "02DEC2015", "5-DEC-2023", "JAN2015")
collection_time<- c("12:15","10:11","01:15",NA)

calculate_iso_date_time(var_name1 = collection_date, format1 = ddmmmyyyy, var_name2 = collection_time, format 2 = hhmm)

Function output - c("2015-11-01T12:15","2015-12-02T10:11","2023-12-05T01:15","2015-01")

Example 3

collection_day <- c("01", "02", "5", NA)
collection_mon <- c("NOV", "DEC", "DEC", "JAN")
collection_year <- c("2015", "2015", "2023", "2015")

collection_hour <- c("12","10","01",NA)
collection_min<- c("15","11","15",NA)

calculate_iso_date_time(
        var_name1 = collection_day, format1 = dd, 
        var_name2 = collection_month, format2 = mon,
        var_name3 = collection_year, format3 = yyyy,
        var_name4 = collection_hour, format4 = hh,
        var_name5 = collection_min, format5 = mm)

Function output - c("2015-11-01T12:15","2015-12-02T10:11","2023-12-05T01:15","2015-01")
ramiromagno commented 10 months ago

Although I understand your code was meant as pseudo-code, I think it helps to get closer to actual R code to clarify the spec. Assuming this is the code you meant:

visit_date  <- c("01NOV2015", "02DEC2015", "5-DEC-2023", "25JAN2015")
calculate_iso_date_time(visit_date, format = "ddmmmyyyy")

Then the output would be:

[1] "2015-11-01" "2015-12-02" "2023-12-05" "2015-01-25"

What should happen if some of the details are missing?

visit_date  <- c("NOV2015", "02-2015", "5-DEC", "25JAN2015")
calculate_iso_date_time(visit_date, format = "ddmmmyyyy")

Possibilities:

# If some fields are missing the result is NA
[1] NA NA NA "2015-01-25"
# Parse what can be parsed and return iso8601 in reduced format
[1] "2015-11--" "2015----02" "--12-05" "2015-01-25"

# Or perhaps this? Note that "-----02" could happen if on "02-2015", 02 was interpreted as day, then "2015"
# would be attempted at being parsed as a month, failing, and the year would be considered missing resulting
# in "-----02"...
[1] "2015-11--" "-----02" "--12-05" "2015-01-25"

Should multiple formats be supported in one call?

visit_date  <- c("01NOV2015", "02DEC2015", "2019-02-03", "20180910")
calculate_iso_date_time(visit_date, format = c("ddmmmyyyy", "ddmmmyyyy", "yyyymmdd", "yyyymmdd"))

Returning:

[1] "2015-11-01" "2015-12-02" "2019-02-03" "2018-09-10"

When you say that the function accepts any number of vectors as an input, could you provide an example? I am asking this because I only see this function taking one input vector of dates/date-times, one vector of formats, and perhaps some other scalar arguments that modify the behavior of the function (e.g., replace_na as @edgar-manukyan suggested).

rammprasad commented 10 months ago

Hi @ramiromagno ,

This cannot happen as all dates are not in the format provided. The date can have missing components though.

visit_date  <- c("NOV2015", "02-2015", "5-DEC", "25JAN2015")
calculate_iso_date_time(visit_date, format = "ddmmmyyyy")

The expected output for this will be

c("2015-11",NA,NA,"2015-01-25")

The reason "02-2015" is NA because it is not in the expected format. The reason 5-DEC is NA is that if the date and month are known, then the year should be present.

I will add more examples.

rammprasad commented 10 months ago

Also, this is impossible as EDC collects data in a way one variable will have a defined format. One vector cannot have multiple formats.

visit_date  <- c("01NOV2015", "02DEC2015", "2019-02-03", "20180910")
calculate_iso_date_time(visit_date, format = c("ddmmmyyyy", "ddmmmyyyy", "yyyymmdd", "yyyymmdd")) 
rammprasad commented 10 months ago

@ramiromagno - I have added three examples in my earlier comment. Please review and let me know in case of any questions.

ramiromagno commented 10 months ago

Thanks, @rammprasad, that helps a lot. Just one more question: assume that the format indicates the month in numeric format and that the year can be in 2-digit format, so format = "dd-mm-yy", then if there are missing details, e.g., the day, let us say "05-03", then it is not obvious if this should be the 5th of March or May, 2003.

ramiromagno commented 10 months ago

If the format is "mm" does it mean numeric month or minutes?

rammprasad commented 10 months ago

Should we keep mm as minutes, and mon can handle numeric and character months?

rammprasad commented 10 months ago

assume that the format indicates the month in numeric format and that the year can be in 2-digit format, so format = "dd-mm-yy"

One of our rules is, if the year is unknown, the month and day cannot be known. So in this case, it will always be May, 2003

ramiromagno commented 10 months ago

Can we generalize a bit more, and say that if a date-time component is missing but its precision is higher than the others, then it is fine, otherwise, a less precise component always jeopardizes higher precision components:

And can the same reasoning be applied to times? If seconds are missing, hour and minutes can still be collected but if hour is missing then we drop the minutes and the seconds...?

rammprasad commented 10 months ago

I am good with the approach mentioned in the above comment

ramiromagno commented 10 months ago

Hi @rammprasad,

What qualifies as a valid year?

In the first post you had this example with 0 which was to be considered invalid. Is there a range of valid values?

ramiromagno commented 10 months ago

What values are admissible for representing missingness. In your first post you had:

Are there other alternative values that may be used to represent missing date-time components? What are the missing values' representations for year and time components?

rammprasad commented 10 months ago

To answer both the questions,

  1. when the format is specified as yyyy, we expect 4 digits. We do not have to validate anything beyond that.
  2. When the format is specified as yy, we expect 2 digits. We do not have to validate anything beyond that. We cannot specify a range for the year
  3. Any other character UN, UNK, U are not recognized as day, month, or year is considered missing.
ramiromagno commented 10 months ago

Hi!,

Here are few examples, although this is still work in progress.

NB: the interface is not yet the one intended but the function format_iso8601() can later on be wrapped in a new one with the desired interface. Also I am using strptime() formats (which are also used by {lubridate} and in other packages), meaning:

These format conventions can be easily changed in my implementation, if you prefer, though this way is more R idiomatic. I think I've covered many of the cases but most likely not all. As you can see that number of characters in the date-time components is irrelevant, i.e. mm does not mean numeric month and mmm abbreviation month name; format_iso8601() just does the right thing.

Currently, the function does mostly syntatic validation according to the specification indicated above, although there will be cases where we might need semantic validation, e.g. when the year is a two digit number, some interpretation will be involved in translating that to a four-digit year. Currently, I am using the same rule as lubridate which is to map years less or equal to 68 to 2000's and above that to 1900's. This will be made a parameter to control this behavior. Also, if a time component is above 60 should we do something about it? The same goes for month days and numeric months. Currently only syntax is validated.

I am not using any of these other date/time related packages. So I might be reinventing the wheel here. But all implementations I looked at always converted missing components to 0, and did not offer a way of handling funny missing values...

Please provide feedback, and if I'm in heading in the right direction then I will write the wrapping function according to the spec, write unit tests and documentation.

library(sdtm.oak)

format_iso8601("15:10", "HH:MM")
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T15:10 15:10 -     -     -     15    10    -
format_iso8601("2:10", "HH:MM")  
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:10 2:10  -     -     -     02    10    -
format_iso8601("2:1", "HH:MM")  
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("2:01", "HH:MM")  
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01 2:01  -     -     -     02    01    -
format_iso8601("02:01", "HH:MM")  
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01 02:01 -     -     -     02    01    -
format_iso8601("02:01:56", "HH:MM:SS")  
#> # A tibble: 1 × 8
#>   iso8601        dttm     year  mon   mday  hour  min   sec  
#>   <chr>          <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01:56 02:01:56 -     -     -     02    01    56
format_iso8601("02:01:56.5", "HH:MM:SS")  
#> # A tibble: 1 × 8
#>   iso8601          dttm       year  mon   mday  hour  min   sec  
#>   <chr>            <chr>      <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01:56.5 02:01:56.5 -     -     -     02    01    56.5

format_iso8601("1510", "HHMM")
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T15:10 1510  -     -     -     15    10    -
format_iso8601("210", "HHMM")  
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:10 210   -     -     -     02    10    -
format_iso8601("21", "HHMM")  
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("201", "HHMM")  
#> # A tibble: 1 × 8
#>   iso8601     dttm  year  mon   mday  hour  min   sec  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01 201   -     -     -     02    01    -
format_iso8601("020156.5", "HHMMSS")
#> # A tibble: 1 × 8
#>   iso8601          dttm     year  mon   mday  hour  min   sec  
#>   <chr>            <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -----T02:01:56.5 020156.5 -     -     -     02    01    56.5

format_iso8601("12 NOV 202015:15", "dd mmm yyyyHH:MM")
#> # A tibble: 1 × 8
#>   iso8601          dttm             year  mon   mday  hour  min   sec  
#>   <chr>            <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-11-12T15:15 12 NOV 202015:15 2020  11    12    15    15    -
format_iso8601("03 NOV 20203:30", "dd mmm yyyyHH:MM")
#> # A tibble: 1 × 8
#>   iso8601          dttm            year  mon   mday  hour  min   sec  
#>   <chr>            <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-11-03T03:30 03 NOV 20203:30 2020  11    03    03    30    -
format_iso8601("02 DEC 201914:00", "dd mmm yyyyHH:MM")
#> # A tibble: 1 × 8
#>   iso8601          dttm             year  mon   mday  hour  min   sec  
#>   <chr>            <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-02T14:00 02 DEC 201914:00 2019  12    02    14    00    -
format_iso8601("U DEC 201914:00", "dd mmm yyyyHH:MM")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("U DEC 201914:00", "dd mmm yyyyHH:MM", na = "U")
#> # A tibble: 1 × 8
#>   iso8601         dttm            year  mon   mday  hour  min   sec  
#>   <chr>           <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12--T14:00 U DEC 201914:00 2019  12    -     14    00    -
format_iso8601("UN DEC 201914:00", "dd mmm yyyyHH:MM", na = "UN")
#> # A tibble: 1 × 8
#>   iso8601         dttm             year  mon   mday  hour  min   sec  
#>   <chr>           <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12--T14:00 UN DEC 201914:00 2019  12    -     14    00    -
format_iso8601("UNK DEC 201914:00", "dd mmm yyyyHH:MM", na = "UNK")
#> # A tibble: 1 × 8
#>   iso8601         dttm              year  mon   mday  hour  min   sec  
#>   <chr>           <chr>             <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12--T14:00 UNK DEC 201914:00 2019  12    -     14    00    -
format_iso8601("3 NOV 2020", "dd mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601    dttm       year  mon   mday  hour  min   sec  
#>   <chr>      <chr>      <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-11-03 3 NOV 2020 2020  11    03    -     -     -
format_iso8601("NOV 2020", "mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm     year  mon   mday  hour  min   sec  
#>   <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-11 NOV 2020 2020  11    -     -     -     -
format_iso8601("MAR 2019", "mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm     year  mon   mday  hour  min   sec  
#>   <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-03 MAR 2019 2019  03    -     -     -     -
format_iso8601("MaR 2019", "mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm     year  mon   mday  hour  min   sec  
#>   <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-03 MaR 2019 2019  03    -     -     -     -
format_iso8601("mar 2019", "mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm     year  mon   mday  hour  min   sec  
#>   <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-03 mar 2019 2019  03    -     -     -     -

format_iso8601("UN APR 2018", "dd mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("UN APR 2018", "dd mmm yyyy", na = "UN")
#> # A tibble: 1 × 8
#>   iso8601 dttm        year  mon   mday  hour  min   sec  
#>   <chr>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2018-04 UN APR 2018 2018  04    -     -     -     -

format_iso8601("UN UNK 2015", "dd mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("UN UNK 2015", "dd mmm yyyy", na = "UN")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("UN UNK 2015", "dd mmm yyyy", na = "UNK")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("UN UNK 2015", "dd mmm yyyy", na = c("UN", "UNK"))
#> # A tibble: 1 × 8
#>   iso8601 dttm        year  mon   mday  hour  min   sec  
#>   <chr>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2015    UN UNK 2015 2015  -     -     -     -     -

format_iso8601("APR 2018", "mmm yyyy")
#> # A tibble: 1 × 8
#>   iso8601 dttm     year  mon   mday  hour  min   sec  
#>   <chr>   <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2018-04 APR 2018 2018  04    -     -     -     -
format_iso8601("UK APR 2018", "dd mmm yyyy", na = "UK")
#> # A tibble: 1 × 8
#>   iso8601 dttm        year  mon   mday  hour  min   sec  
#>   <chr>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2018-04 UK APR 2018 2018  04    -     -     -     -
format_iso8601("UN UNK 0", "dd mmm yyyy", na = c("UN", "UNK"))
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("UN UN 2015", "dd mmm yyyy", na = c("UN", "UNK"))
#> # A tibble: 1 × 8
#>   iso8601 dttm       year  mon   mday  hour  min   sec  
#>   <chr>   <chr>      <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2015    UN UN 2015 2015  -     -     -     -     -

format_iso8601("20190212", "yyyymmdd")
#> # A tibble: 1 × 8
#>   iso8601    dttm     year  mon   mday  hour  min   sec  
#>   <chr>      <chr>    <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-02-12 20190212 2019  02    12    -     -     -
format_iso8601("2019020202:20", "yyyymmddHH:MM")
#> # A tibble: 1 × 8
#>   iso8601          dttm          year  mon   mday  hour  min   sec  
#>   <chr>            <chr>         <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-02-02T02:20 2019020202:20 2019  02    02    02    20    -
format_iso8601("2019-04-041045", "yyyy-mm-ddHHMM")
#> # A tibble: 1 × 8
#>   iso8601          dttm           year  mon   mday  hour  min   sec  
#>   <chr>            <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-04-04T10:45 2019-04-041045 2019  04    04    10    45    -
format_iso8601("2019-04-041045-", "yyyy-mm-ddHHMM-")
#> # A tibble: 1 × 8
#>   iso8601          dttm            year  mon   mday  hour  min   sec  
#>   <chr>            <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-04-04T10:45 2019-04-041045- 2019  04    04    10    45    -
format_iso8601("2019-04-041045-", "y-m-dHM-")
#> # A tibble: 1 × 8
#>   iso8601          dttm            year  mon   mday  hour  min   sec  
#>   <chr>            <chr>           <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-04-04T10:45 2019-04-041045- 2019  04    04    10    45    -

format_iso8601("20200507null", "yyyymmddHHMM")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("20200507null", "yyyymmdd")
#> # A tibble: 1 × 8
#>   iso8601 dttm  year  mon   mday  hour  min   sec  
#>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>    <NA>  -     -     -     -     -     -
format_iso8601("20200507null", "yyyymmddnull")
#> # A tibble: 1 × 8
#>   iso8601    dttm         year  mon   mday  hour  min   sec  
#>   <chr>      <chr>        <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-05-07 20200507null 2020  05    07    -     -     -
format_iso8601(c("20200507null", "2019120602:20:13"), "ymd((HH:MM:SS)|null)")
#> # A tibble: 2 × 8
#>   iso8601             dttm             year  mon   mday  hour  min   sec  
#>   <chr>               <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2020-05-07          20200507null     2020  05    07    -     -     -    
#> 2 2019-12-06T02:20:13 2019120602:20:13 2019  12    06    02    20    13

format_iso8601("2019120602:20:13", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601             dttm             year  mon   mday  hour  min   sec  
#>   <chr>               <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13 2019120602:20:13 2019  12    06    02    20    13
format_iso8601("2019120602:20:13", "yyyymmddHH:MM:SS")
#> # A tibble: 1 × 8
#>   iso8601             dttm             year  mon   mday  hour  min   sec  
#>   <chr>               <chr>            <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13 2019120602:20:13 2019  12    06    02    20    13
format_iso8601("2019120602:20:13.", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601             dttm              year  mon   mday  hour  min   sec  
#>   <chr>               <chr>             <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13 2019120602:20:13. 2019  12    06    02    20    13
format_iso8601("2019120602:20:13.0", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601             dttm               year  mon   mday  hour  min   sec  
#>   <chr>               <chr>              <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13 2019120602:20:13.0 2019  12    06    02    20    13
format_iso8601("2019120602:20:13.1", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601               dttm               year  mon   mday  hour  min   sec  
#>   <chr>                 <chr>              <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.1 2019120602:20:13.1 2019  12    06    02    20    13.1
format_iso8601("2019120602:20:13.123", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                 dttm               year  mon   mday  hour  min   sec  
#>   <chr>                   <chr>              <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.123 2019120602:20:13.… 2019  12    06    02    20    13.1…
format_iso8601("2019120602:20:13.123000", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                 dttm               year  mon   mday  hour  min   sec  
#>   <chr>                   <chr>              <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.123 2019120602:20:13.… 2019  12    06    02    20    13.1…
format_iso8601("2019120602:20:13.1230001", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                     dttm           year  mon   mday  hour  min   sec  
#>   <chr>                       <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.1230001 2019120602:20… 2019  12    06    02    20    13.1…
format_iso8601("2019-120602:20:13.1230001", "y-mdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                     dttm           year  mon   mday  hour  min   sec  
#>   <chr>                       <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.1230001 2019-120602:2… 2019  12    06    02    20    13.1…
format_iso8601("19-120602:20:13.1230001", "y-mdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                     dttm           year  mon   mday  hour  min   sec  
#>   <chr>                       <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.1230001 19-120602:20:… 2019  12    06    02    20    13.1…
format_iso8601("19120602:20:13.1230001", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                     dttm           year  mon   mday  hour  min   sec  
#>   <chr>                       <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2019-12-06T02:20:13.1230001 19120602:20:1… 2019  12    06    02    20    13.1…
format_iso8601("80120602:20:13.1230001", "ymdH:M:S")
#> # A tibble: 1 × 8
#>   iso8601                     dttm           year  mon   mday  hour  min   sec  
#>   <chr>                       <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1980-12-06T02:20:13.1230001 80120602:20:1… 1980  12    06    02    20    13.1…

Created on 2023-10-30 with reprex v2.0.2

venkatamaguluri commented 10 months ago

Here is the input from Pfizer (Venkata Maguluri)

  1. Identify the format of date and time by R code itself
  2. user just give target element and input elements (date and time, time is optional)
ramiromagno commented 10 months ago

Hi Venkata,

Thank you for your feedback.

I am not sure I understand your suggestions.

  1. Identify the format of date and time by R code itself

Assuming you mean that the feature requested function should automagically detect the format and parse accordingly, then this request is changing the specification indicated by @rammprasad and is considerably more difficult to implement as it will probably require attempting different formats, and in a specific order... it really is a feature request with a more demanding implementation. Perhaps we can discuss this in Slack.

  1. user just give target element and input elements (date and time, time is optional)

Again, I'm not sure I understand your point/suggestion... but it seems to be hinting at a different spec. Again, Slack might be more appropriate to discuss this.

venkatamaguluri commented 10 months ago
  1. see below piece of code example, where it look input element and identify the date format and apply to ISO date appropriately.

install.packages("lubridate")

suppressMessages(library(lubridate))

Create a sample data frame with date columns in different formats

df <- data.frame( id = c(1) date1 = c("01012021", "02022022", "03032023"),

)

Function to identify and convert date format to ISO 8601

convert_to_iso8601 <- function(date_value) {

Define potential date formats

formats <- c("%d%m%Y", "%d%m%y", "%Y%m%d")

Try parsing the date value using different formats until successful

for (format in formats) { parsed_date <- try(parse_date_time(date_value, format), silent = TRUE) if (!inherits(parsed_date, "try-error")) { return(format(parsed_date, "%Y-%m-%d")) } }

Return the original value if no valid format is found

return(date_value) }

Convert date columns to ISO 8601 format

df[, -1] <- sapply(df[, -1], convert_to_iso8601)

Check the updated data frame

print(df)


2. if you see input should be just input domain and input elements to call by end user
edgar-manukyan commented 10 months ago
edgar-manukyan commented 10 months ago

Most of the time date and time come in two separate variables.

rammprasad commented 10 months ago

@ramiromagno - the examples in your above comment look good to me. Few questions to consider

  1. Can you clarify how you will be incorporating the below example?
collection_date <- c("01NOV2015", "02DEC2015", "5-DEC-2023", "JAN2015")
collection_time<- c("12:15","10:11","01:15",NA)

calculate_iso_date_time(var_name1 = collection_date, format1 = ddmmmyyyy, var_name2 = collection_time, format 2 = hhmm)

Function output - c("2015-11-01T12:15","2015-12-02T10:11","2023-12-05T01:15","2015-01")
  1. We need to discuss the warning messages when an incorrect format is provided for the value. For example, what kind of warning message do we provide if the format doesn't match with the input date vector.

    format_iso8601("21", "ddMMMyyyy")  
    #> # A tibble: 1 × 8
    #>   iso8601 dttm  year  mon   mday  hour  min   sec  
    #>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
    #> 1 <NA>    <NA>  -     -     -     -     -     -
  2. Specifying Unknown components has to be a vector. All of these values can represent a missing date or time component and will appear in the same input date or time vector. I think this is already handled, just confirming.

format_iso8601(c("10 DEC 201914:00", "U UNK 2020", "12 DEC 2020 UN:UK", "dd mmm yyyyHH:MM", na = c("U","UNK", "UK"))
  1. How do we handle if the date is present and the time is missing (not unknown, it is just missing). We expect the date component to be converted to iso8601 format. The user may not specify null
format_iso8601(c("10 DEC 201914:00", "12 DEC 2020", "dd mmm yyyyHH:MM")
ramiromagno commented 10 months ago

Hi @rammprasad:

Thank you for your feedback.

Before addressing your points, let me discuss a prior note about the user-facing function name: I think we can perhaps use a different name than "calculate". I think a better word might a verb that most closely matches what is actually being performed by such a function. We are essentially parsing and then assembling/creating/making the iso8601 formatted date-time string. So something as simple as create_iso8601() or create_iso8601_dttm() might convey its role/meaning more closely.

rammprasad commented 10 months ago

Ok, that sounds good. TY, @ramiromagno . Let's proceed with the implementation and creating test cases.

kamilsi commented 10 months ago
  1. No timezone does not play a role.

Hi @rammprasad,

In line with our adherence to ODM standards, as outlined here (https://wiki.cdisc.org/display/ODM2/Data+Formats), should we integrate a validation mechanism to confirm that the input data complies with these standards, including the handling of time zone information?

Given that ODM accommodates time zone data, I'm keen to understand our rationale for not addressing time zone considerations within our function. Insights into this decision would greatly assist in ensuring our oak.SDTM package's features are consistent with standard practices and that our documentation clearly communicates this aspect to users.

More broadly, I’m aiming to clarify how strictly our package should mirror ODM specifications, particularly features like time zone support. This understanding will guide us in dealing with similar considerations moving forward.

I look forward to your thoughts on this matter.

rammprasad commented 10 months ago

Hi @kamilsi, Our original plan is to use the ODM extracts from EDC systems. Later, we figured there was no easier way to extract the data in the ODM format from the EDC system due to the IT limitations. So, we decided to pivot from that option. Slides and other materials still need to be updated. EDC systems do not provide the timezone. So we are good, not considering the timezone component.

ramiromagno commented 9 months ago

Note to self: use triple colon ::: for examples in internal functions' documentation.