oobianom / quickcode

An R package made out of mine and Brice's scrapbook of much needed functions.
https://quickcode.obi.obianom.com
Other
5 stars 0 forks source link

Extract One or More Dates Referenced in a Vector String #31

Open brichard1638 opened 5 months ago

brichard1638 commented 5 months ago

The lubridate package in R provides a set of functions designed to work with date-times and time-spans - essentially, temporal data. These functions facilitate extracting, converting, and configuring temporal data. In addition, lubridate is one of the core packages included in R's tidyverse.

BACKGROUND The lubridate package offers a collection of functions specifically designed for use with Natural Language Processing (NLP). The challenge these functions address is to create a means by which to isolate and extract temporal language in the form of dates and times found in a vector string without applying a regular expression. Specific to dates, these functions include the following:

The key to ensuring output accuracy when applying lubridate’s temporal functions is to apply the correct function consistent with its naming convention. For example, the classic function mdy isolates and captures a specifically formatted date that first references a month, then a day, and concludes with a year. Using these function types, no time reference is considered.

The date structure is critical to a successful implementation of these functions. While there are several date configurations from which to choose, the most commonly used date format referenced in a vector string is encoded in a month-day-year pattern.

LIMITATIONS OF LUBRIDATE'S DATE FUNCTIONS Lubridate package date functions are extremely sensitive to vector data encoding configurations. As a result, vector strings can be encoded in a way that causes the functions to return inaccurate results. A correctly defined example of the application of a lubridate function is provided in Example 1:

Example 1: library(lubridate) str1 = "The video was recorded on July 19, 2023." mdy(str1) [1] "2023-07-19"

While Example 1 returns an accurate result, a slight modification in the structure of the str1 vector, called str2, produces a very different result:

Example 2: str2 = "The video was recorded over a 4 hour period starting on July 19, 2023." mdy(str2) [1] "2023-04-19"

The output returned in Example 2 is incorrect. The mdy function returns an inaccurate date due to the confusion created by adding a non-temporal numeric value to the vector.

In Example 3, two dates occur within a vector string. Applying the same lubridate function of mdy, and complying with the required month-day-year format, the following results are returned:

Example 3: str3 = "The first batch of reports are due on July 12, 2024; the second batch on 7/19/24." mdy(str3) [1] NA Warning message: All formats failed to parse. No formats found.

In the final example, a vector of length two with interspersed dates also yields the same kind of error generated in Example 3:

Example 4: str4 = c("On 3.12.25, Jerry is taking one month of leave and is not scheduled to return until around 4-9-2025.", "The staff will be out on training on 10/11/24, Oct 12, 2024, and 10-13-24.") mdy(str4) [1] NA NA Warning message: All formats failed to parse. No formats found.

These results showcase a serious problem and are consistent with every date-based function provided in the lubridate package. (For a listing of impacted functions, see the lubridate R package documentation ver. 1.9.3 pp. 66-72).

PROPOSED SOLUTION A preliminary search of an R function that solves the problems highlighted in the previous section could not be found. However, a combination of functions in R can be used in a script to provide a solution. In lieu of this piecemeal approach, a comprehensive method in the form of a single function should be made available to address this challenge.

PROPOSED FUNCTION NAME To mitigate the limitations of lubridate’s temporal functions when extracting dates, an alternative, customized function can be used. This alternative function, called getDate, extends the functionality of lubridate’s date functions, returning accurate results.

TOTAL NUMBER OF FUNCTIONAL ARGUMENTS 1 - The argument name is vec

FUNCTION STRUCTURE The following structure comprises the getDate function herein proposed: getDate <- function(vec) {

# Match various date patterns using regex dt_pattern <- "(\\d{1,2}[/-]\\d{1,2}[/-]\\d{2,4}|\\d{1,2}-[A-Za-z]{3}-\\d{2,4}|[A-Za-z]+ \\d{1,2},? \\d{4}|\\d{8}|\\d{6}|\\d{1,2}\\.\\d{1,2}\\.\\d{2,4})"

# Extract all occurrences of a date extracted_dt <- regmatches(vec, gregexpr(dt_pattern, vec))

# Convert dates to an object of Class Date extracted_dt <- lapply(extracted_dt, easyr::todate)

# Return the extracted dates as a list object (or character(0) if not found) return(extracted_dt) }

TEST STATUS The getDate function, as defined under the FUNCTION STRUCTURE section of this issue has been tested, yielding accurate results. However, more testing should be conducted to verify both its accuracy and functional utility.

FUNCTION EXAMPLES Applying the same vectors used in all the previous examples, the following results, applying the getDate function, are provided:

str1 = "The video was recorded on July 19, 2023." str2 = "The video was recorded over a 4 hour period starting on July 19, 2023." str3 = "The first batch of reports are due on July 12, 2024; the second batch on 7/19/24." str4 = c("On 3.12.25, Jerry is taking one month of leave and is not scheduled to return until around 4-9-2025.", "The staff will be out on training on 10/11/24, Oct 12, 2024, and 10-13-24.")

getDate(str1) [[1]] [1] "2023-07-19"

getDate(str2) [[1]] [1] "2023-07-19"

getDate(str3) [[1]] [1] "2024-07-12" "2024-07-19"

getDate(str4) [[1]] [1] "2025-03-12" "2025-04-09"

[[2]] [1] "2024-10-11" "2024-10-12" "2024-10-13"

CONCLUSION The getDate function should be used when vector strings containing dates possess the following structural characteristics:

The getDate function can successfully extract multiple dates from different vector structures because:

It certainly could be argued that lubridate's temporal functions were not designed to be used with vector strings, but instead, only with date-based variables. This argument, if true, seriously minimizes the utility of these functions.

Finally, the dates generated by getDate have been formalized in R to be objects of class Date within the list object to which they have been configured. Both the proposed function and its process approach solve the problems that inherently exist in lubridate's date functions.

NOTE1: It is by design that not every possible date configuration has been captured in the getDate function, including incomplete date formats. Examples where the getDate function are known to fail include but are not limited to date formats like 12.10, Dec 21.12, or 3/5.1976. It is certainly possible to create a function specifically designed to address misaligned date structures, which could be a consideration for development in the future.

NOTE2: This extended information is designed to capture the functional structure, examples, and explanations that can be used in developing functional documentation.