colsplit should allow use of Perl regular expressions

zackw commented 12 years ago

Motivating example: consider this cut-down data frame

X <- structure(list(
    origin = structure(1L, .Label = c("c"), class = "factor"),
    cluster = structure(1L, .Label = c("3"), class = "factor"),
    n = structure(1L, .Label = c("1"), class = "factor"),
    distance = 0.0781457901901654, t0 = 0, t1 = 0, t2 = 0, t3 = 0.1125,
    t4 = 0.09, t5 = 0.241666666666667, t6 = 0.35, t7 = 0.43125,
    t8 = 0.494444444444444, t9 = 0.545, t10 = 0.586363636363636,
    t11 = 0.620833333333333, t12 = 0.65, t13 = 0.675,
    t14 = 0.696666666666667, t15 = 0.715625, t16 = 0.732352941176471,
    t17 = 0.747222222222222, t18 = 0.760526315789474, t19 = 0.7725,
    l0 = 132L, l1 = 198L, l2 = 309L, l3 = 1353L, l4 = 74L, l5 = 586L,
    l6 = 586L, l7 = 586L, l8 = 586L, l9 = 586L, l10 = 586L, l11 = 586L,
    l12 = 586L, l13 = 586L, l14 = 1172L, l15 = 586L, l16 = 586L,
    l17 = 586L, l18 = 586L, l19 = 586L), .Names = c("origin",
    "cluster", "n", "distance", "t0", "t1", "t2", "t3", "t4", "t5",
    "t6", "t7", "t8", "t9", "t10", "t11", "t12", "t13", "t14", "t15",
    "t16", "t17", "t18", "t19", "l0", "l1", "l2", "l3", "l4", "l5",
    "l6", "l7", "l8", "l9", "l10", "l11", "l12", "l13", "l14", "l15",
    "l16", "l17", "l18", "l19"), row.names = 5L, class = "data.frame")

The "measure variables" are tXX and lXX. I want to give the t's and the l's each their own column. This would be as simple as

Y <- melt(X, id.vars=c('origin','cluster','n','distance'))
Y <- cbind(Y, colsplit(Y$variable, split="(?<=[lt])", perl=TRUE,
                       names=c("var", "i")))
cast(Y, origin + cluster + distance + n + i ~ var)

if colsplit supported perl=TRUE, but it doesn't, and without that I see no way to do this short of reimplementing colsplit myself. The underlying strsplit does support perl=TRUE so it's as simple as accepting the argument and passing it down. (Note: in the present codebase, strsplit seems to have been replaced with a function str_split_fixed whose definition I cannot find.)

hadley commented 12 years ago

You can already: colsplit(Y$variable, perl("(?<=[lt])"), names=c("var", "i")). Make sure you have a recent version of stringr.

zackw commented 12 years ago

OK, I guess that's better than adding perl= boolean arguments all over the place.

hadley / reshape

colsplit should allow use of Perl regular expressions #12