Develop a function to split text exceeding 200 characters

parikp06 commented 11 months ago

Feature Idea

Purpose

Split text exceeding 200 characters into multiple variables. SDTMIG v3.3 | CDISC

Functionality

Based on the current requirement for the SAS v5 Transport file format, it is not possible to store the long text strings exceeding 200 characters using only one variable. Therefore, the SDTMIG has defined conventions for storing long text string using multiple variables. Follow the steps mentioned below to create a function to split long text into multiple variables.

1) Split the first 200 characters of text into split_var1 and each additional 200 characters of text into split_varN

2) When splitting a text string into several variables, the text should be split between words to improve readability, so it does not break a word.

3) Create a variable split_count to hold the number of new variables created by the split.

4) Default length to split on would be 200. However, please allow this value can be passed via split_length parameter to any other length.

Relevant Input

A character variable to be split into multiple variables.

An integer parameter split_length which would be set to 200 by default.

Relevant Output

One or more variables after splitting: split_var1 – split_varN.

Variable to hold the number of splits: split_count

Reproducible Example/Pseudo Code

split_vars(var_to_split,split_length=200)

kamilsi commented 11 months ago

Acknowledging the historical context of SAS V5 transport file restrictions, I'd like to point out that these limitations don't apply in R, offering us a chance to opt-out. However, my experience with submissions is limited, and I'm seeking clarity: are we considering this feature to meet current regulatory expectations, or is it more about maintaining legacy compatibility?

parikp06 commented 11 months ago

Agree that SAS V5 limitations do not apply to R dataframes. However, we still want to have this feature to be compliant with SDTMIG rules. I am also lacking submission experience with R packages, so looking for more feedback.

pendingintent commented 11 months ago

The submission SAS files still have the V5 limitations, which include no string can exceed 200 characters in length. Even if the xpt files are generated using R, the SASV5 limitations still apply unfortunately.

pendingintent commented 11 months ago

I had a look at this in SQL and if we consider a text field mapping to DSTERM for example where the field can be a maximum of 600 characters and we cannot split words, the formula for calculating the max number of required output variables is: (# required target variables) = CEILING(length source string / 200) + 1

The first (up to) 200 characters maps to DSTERM, with the remainder of the source string mapping to the QNAM variables DSTERM1, DSTERM2, DSTERM3.

I have used the SQL code: dsterm := upper(substr(input_string, 1, instr(substr(input_string, 1, 200), ' ', -1, 1 )));

This will select up to 200 characters without splitting words and insert into the DSTERM DS variable. Similar statements can then be used to select the other strings for each of the subsequent supplemental QVAL values.

kamilsi commented 11 months ago

Hi @parikp06 and @pendingintent,

I appreciate the clarification on the compliance with SDTMIG rules and the current submission requirements. Given this context, I'm ready to take on the development of this feature. I'll await any additional feedback or a decision on moving forward with this approach.

MaguluriVenkata commented 11 months ago

it's not clear where SPLITVARN will be stored in the specs? what's the plan?

parikp06 commented 11 months ago

This function will be called within a dataframe with input variable to be split. Upon successful execution, it should generate SplitVarN variables along with SplitCount to the output dataframe.

MaguluriVenkata commented 11 months ago

So it does not push SPLITVARn into SUPP domains?

parikp06 commented 11 months ago

no, it'll just create required list of variables. SplitCount will help in determining how many SUPP variables to be populated.

rammprasad commented 11 months ago

Keep the 200 character limit as parameterized and keep the default value as 200.

galachad commented 10 months ago

Hi @madhan0923 and @GomathiVallinayagam, are you working on this issue, or can I take it over?

GomathiVallinayagam commented 10 months ago

Yes @galachad , we are working on the issue.

edgar-manukyan commented 10 months ago

Adam recently did a very nice refactor of a similar function in Roche roak package. I would encourage collaborating with him, at least to get his ideas of function design 🙏

madhan0923 commented 10 months ago

sure @edgar-manukyan, we are about to complete the draft. with Adam's help we can release the 1st version for QC

ramiromagno commented 7 months ago

Hi Madhan (@madhan0923):

I just had a look at your code in PR https://github.com/pharmaverse/sdtm.oak/pull/32. I have a few suggestions/remarks.

I think your implementation of split_var() can be significantly simplified if you use either R base strwrap() or stringr::str_wrap().

In addition, it might be a good idea to also implement a reverse function, e.g. str_unwrap() that combines again the strings, something akin to paste(x, collapse = " "). The reason for this suggestion is that by having this reverse function we can set an expectation on what other downstream software should be getting when re-assembling the text.

From reading SDTMIG v3.4. (page 55), I think three questions remain:

What qualifies as an inter-word char, i.e. where can a split happen?
What happens if the text is a 200 char string (no inter-word chars). Functions such as strwrap() won't split if there are no inter-word chars, which is not acceptable according to the IG. I understand this is an unlikely corner case but still...
Should the inter-word chars be preserved in the output? I guess that if spaces are the inter-word chars, then probably not, as you do not want to leave a trailing space at the end of each string... But if the inter-word is another char, e.g. ";", I see that in your test code for split_var() you are indeed preserving the ";":
```
> split_var(x)
[[1]]
[1] "ASCORBIC ACID;BETACAROTENE;BIOTIN;CALCIUM;CHLORIDE;CHROMIUM;COPPER;FOLIC ACID;IODINE;LYCOPENE;MAGNESIUM;MANGANESE;MOLYBDENUM;NICKEL;NICOTINIC ACID;PANTOTHENIC ACID;PHOSPHORUS;POTASSIUM;"

[[2]]
[1] "PYRIDOXINE HYDROCHLORIDE;RIBOFLAVIN;SELENIUM;SILICON;THIAMINE;VANADIUM;VITAMIN B12 NOS;VITAMIN D NOS;VITAMIN E NOS;"
```
If you do not preserve the splitting chars, then how should downstream software be able to reconstruct the text...?

It might help taking a look at some code I've written in the past that uses R matrices to represent protein alignments, where I developed a similar function to reflow the alignment to a different width: https://github.com/maialab/agvgd/blob/master/R/split_alignment_by_lines.R

Longfei2 commented 6 months ago

Hello everyone, I have developed an R function for my company and be used in work successfully. Hope it helpful for oak

#' @param data A data frame or tibble.
#' @param target_var A character variable in input data set that need to be split, like \code{product_term}.
#' if the name of \code{target_var} is same as \code{outvar_name}, then the name of \code{target_var} is appended with a prefix, \code{(.)}.
#' @param outvar_name A string which is the prefix of similar variable names, like \code{"CMDECOD"}.
#' @param sep A character vector that is separator delimiting collapsed values.
#' 
#' @return A tibble including a series of variable calling \code{outvar_name}, which is invisible object.

dso_str_split <- function(data, target_var, outvar_name, sep = ";") {
  # Remove a series of variable named outvar_name with numeric suffix
  .to_keep <- names(data)[!grepl(paste0(outvar_name, "([0-9]+)$"), names(data))]
  .new_data <- data %>%
    dplyr::mutate(dso_unique_id = if (nrow(.) == 0L) {
      numeric(0L)
    } else {
      dplyr::row_number()
    }) %>%
    dplyr::select(dplyr::all_of(.to_keep), dso_unique_id)

  .need_split_var <- rlang::enquo(target_var)
  if (is.na(sep)) sep <- "\\;"

  # dso_check_strs(outvar_name, sep)

  # separate string
  .split_data <- .new_data %>%
    dplyr::mutate(dso_split_var = !!.need_split_var) %>%
    tidyr::separate_rows(dso_split_var, sep = sep) %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep)) %>%
    dplyr::mutate(
      dso_var1 = if (is.null(dso_cumsum(nchar(dso_split_var), x1 > 200L))) {
        numeric(0L)
      } else {
        dso_cumsum(nchar(dso_split_var), x1 > 200L)
      },
      dso_var2 = ifelse(nchar(dso_split_var) == dso_var1, TRUE, FALSE),
      dso_subid = dplyr::coalesce(cumsum(dso_var2), 1L),
      dso_text = dplyr::coalesce(dso_split_var, "")
    ) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep), dso_subid) %>%
    dplyr::summarise(dso_text = paste(dso_text, collapse = sep)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate_at("dso_text", ~ dplyr::if_else(stringr::str_squish(.) == "", NA_character_, .)) %>%
    # Avoid intermediate separator causing redundant symbols at the end
    dplyr::mutate(dso_text = gsub(paste0("(", sep, "+)$"), "", dso_text)) %>%
    tidyr::pivot_wider(names_from = dso_subid, values_from = dso_text, names_prefix = "dso_text") %>%
    dplyr::select(-dso_unique_id)

  # if the name of target_var and .out_var is same, then rename target_var to paste(., target_var)

  if (identical(rlang::as_label(.need_split_var), outvar_name)) {
    names(.split_data)[names(.split_data) == outvar_name] <- paste0(".", outvar_name)
  }

  # modify the name of series variable calling outvar_name
  .text_name <- names(dplyr::select(.data = .split_data, dplyr::starts_with("dso_text")))

  if (length(.text_name) == 1L) {
    names(.split_data)[names(.split_data) %in% .text_name] <- outvar_name
  } else {
    names(.split_data)[names(.split_data) %in% .text_name] <- c(outvar_name, paste0(outvar_name, 1L:(length(.text_name) - 1)))
  }

  invisible(.split_data)
}

dso_cumsum <- function(num, cond) {
  .cond <- rlang::enexpr(cond)
  Reduce(function(x1, x2) {
    x1 <- sum(x1, (x2 + 1L), na.rm = TRUE)
    if (eval(.cond)) {
      x1 <- 0L
      x1 <- sum(x1, x2, na.rm = TRUE)
    }
    return(x1)
  }, num, accumulate = TRUE)
}

library(dplyr)

test_df <- data.frame(
  id = 1:2,
  text = c(
    paste(strrep("a", 150), ";", strrep("a", 50), strrep("a", 130), sep = ";"),
    paste(strrep("a", 150), strrep("a", 30), strrep("a", 150), strrep("a", 80), sep = ";")
  )
)

new_test_df <- test_df %>% dso_str_split(text, "text", "\\;")

pharmaverse / sdtm.oak