csv_to_hier not working as expected when only one element is spitted out by str_split

homer3018 commented 4 years ago

Let's suppose we have this list

list(c("a","b","c") %>% paste0(collapse = "-"),
c("a2") %>% paste0(collapse = "-"),
c("a3","b3","c3","d3") %>% paste0(collapse = "-"))

There is no separator found for a2, and csv_to_hier currently reports it in the column t.rw due to the lapply call:

dplyr::bind_rows(
  lapply(
    strsplit(as.character(list(c("a","b","c") %>% paste0(collapse = "-"),
                               c("a2") %>% paste0(collapse = "-"),
                               c("a3","b3","c3","d3") %>% paste0(collapse = "-"))), delim),
    function(rw) data.frame(t(rw), stringsAsFactors = FALSE)
  )
)

gives

    X1   X2   X3 t.rw.   X4
1    a    b    c  <NA> <NA>
2 <NA> <NA> <NA>    a2 <NA>
3   a3   b3   c3  <NA>   d3

that's obviously not right. Using plyr::ldply with rbind would do just fine in such case :

plyr::ldply(strsplit(as.character(list(c("a","b","c") %>% paste0(collapse = "-"),
                                       c("a2") %>% paste0(collapse = "-"),
                                       c("a3","b3","c3","d3") %>% paste0(collapse = "-"))), delim), rbind)

gives

   1    2    3    4
1  a    b    c <NA>
2 a2 <NA> <NA> <NA>
3 a3   b3   c3   d3

Due to this, the following code any(is.na(df[,1])) kicks in and make a mess (specifically with idx_notna <- which(!is.na(df[,1])) then df[idx_notna,2] <- df[idx_notna,1] it deletes parts of the hierarchy and does not fix the original issue since we cannot predict the position of the column t.rw. If csv hasn't got any NA, then we shouldn't need such checks.

Because of that sund2b works but is not correct.

cjyetman commented 4 years ago

If I understand correctly, a minimal reproducible example would be...

# works as expected
data <- read.table(header = FALSE, text = "
a-b-c   124
a2-b2  42")
sunburstR::sunburst(data)

# does not work as expected
data <- read.table(header = FALSE, text = "
a-b-c   124
a2  42")
sunburstR::sunburst(data)

ultimately, I believe this comes down to the difference between these...

data.frame(t(c("a")), stringsAsFactors = FALSE)
#>   t.c..a...
#> 1         a

data.frame(t(c("a", "b")), stringsAsFactors = FALSE)
#>   X1 X2
#> 1  a  b

which probably means the csv_to_hier function would need to use setNames() on each iteration of its apply function so that dplyr::bind_rows doesn't move things around in an attempt to match column names.

Or, that bit could be re-written in a way that avoids the rbind/dplyr::bind_rows process, for instance...

# instead of this
dplyr::bind_rows(
  lapply(
    strsplit(as.character(csv[[1]]), delim),
    function(rw) data.frame(t(rw), stringsAsFactors = FALSE)
  )
)
#>     X1   X2   X3 t.rw.
#> 1    a    b    c  <NA>
#> 2 <NA> <NA> <NA>    a2

# this?
read.delim(header = FALSE, sep = delim, text = paste(csv[[1]], collapse = "\n"))
#>   V1 V2 V3
#> 1  a  b  c
#> 2 a2

homer3018 commented 4 years ago

That is 100% correct. Your minimal example are completely illustrating the issue. Not only a2 is not being drawn but a-b-c is also altered to a-c.

Using plyr::ldply or read.delim would definitely solve the issue, which ever suits you best.

One thing though, I haven't been able to locate the function csv_to_hier. I have found its code online. Where would this be ?

cjyetman commented 4 years ago

csv_to_hier is an internal function, i.e. it is not intended for direct use, therefore it is not "exported". You can however access it using the sunburstR:::csv_to_hier() syntax.

homer3018 commented 4 years ago

Right. It seems that it does not work on my machine :

> sunburstR::csv_to_hier
Error: 'csv_to_hier' is not an exported object from 'namespace:sunburstR'
> sunburstR::csv_to_hier()
Error: 'csv_to_hier' is not an exported object from 'namespace:sunburstR'

cjyetman commented 4 years ago

You have to use 3 :'s, not 2 to access a non-exported function... ::: instead of ::

homer3018 commented 4 years ago

My bad. I should have known better! Thanks !

homer3018 commented 4 years ago

Coming back to the original issue, I've just tested what you suggested ie

read.delim(header = FALSE, sep = delim, text = paste(csv[[1]], collapse = "\n"))

And it does not quite work as expected since read.delim determines The number of data columns [...] by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary (as in the ‘Examples’).

In other words this

read.delim(header = FALSE, sep = delim, text = paste(as.character(list(c("a","b","c") %>% paste0(collapse = "-"),
                                                                       c("a2") %>% paste0(collapse = "-"),
                                                                       c("a","b","c") %>% paste0(collapse = "-"),
                                                                       c("a2") %>% paste0(collapse = "-"),
                                                                       c("a","b","c") %>% paste0(collapse = "-"),
                                                                       c("a3","b3","c3","d3") %>% paste0(collapse = "-"))), collapse = "\n"))

produces this

  V1 V2 V3
1  a  b  c
2 a2      
3  a  b  c
4 a2      
5  a  b  c
6 a3 b3 c3
7 d3

Using plyr:ldply is producing the expected result

plyr::ldply(strsplit(as.character(list(c("a","b","c") %>% paste0(collapse = "-"),
                                       c("a2") %>% paste0(collapse = "-"),
                                       c("a","b","c") %>% paste0(collapse = "-"),
                                       c("a2") %>% paste0(collapse = "-"),
                                       c("a","b","c") %>% paste0(collapse = "-"),
                                       c("a3","b3","c3","d3") %>% paste0(collapse = "-"))), delim), rbind)

gives

   1    2    3    4
1  a    b    c <NA>
2 a2 <NA> <NA> <NA>
3  a    b    c <NA>
4 a2 <NA> <NA> <NA>
5  a    b    c <NA>
6 a3   b3   c3   d3

cjyetman commented 4 years ago

good point! Just to be clear, I am not the maintainer of this project, just a former contributor, so I'm merely making suggestions.

I doubt that @timelyportfolio will want to add plyr as a dependency to this package since it is officially "retired", but that is not my decision to make.

May I make one suggestion to you @homer3018... in your examples you repeatedly use this very complex process of creating a list object...

list(c("a","b","c") %>% paste0(collapse = "-"),
      c("a2") %>% paste0(collapse = "-"),
      c("a","b","c") %>% paste0(collapse = "-"),
      c("a2") %>% paste0(collapse = "-"),
      c("a","b","c") %>% paste0(collapse = "-"),
      c("a3","b3","c3","d3") %>% paste0(collapse = "-"))

That structure is very difficult to quickly interpret, and it also requires a 3rd party package to be loaded for the %>% to work. You could replace that with...

list("a-b-c",
     "a2",
     "a-b-c",
     "a2",
     "a-b-c",
     "a3-b3-c3-d3")

which would then allow the reader to focus on where the actual bug exists rather than trying to error check an unnecessarily complex object creation.

homer3018 commented 4 years ago

Apologies for that, I copied/pasted things from my actual code in which I have to resort to the pipe and paste commands, I should have simplified it.

I agree that plyr might not be the right tool to fix this issue, however this is very annoying to have this not working as I currently have to deal with a lot of root objects.

That's mainly why I've not opened a pull request. Surely there is a better way of fixing this.

Thanks for your support so far. I guess it's up to @timelyportfolio now.

cjyetman commented 4 years ago

Not sure what exactly you're trying to achieve, but if the goal is to be able to pass data to sunburstR that might have this structure that doesn't process properly, you could write a function to preprocess the incoming data (using plyr or whatever your preferred method is) before passing to sunburstR...

fix_data <- function(data, delim = "-") {
  node_data <- data[[1]]
  node_data <- plyr::ldply(strsplit(as.character(node_data), delim), rbind)
  node_data$size = data[[2]]
  d3r::d3_nest(node_data, value_cols = "size")
}

data <- read.table(header = FALSE, text = "
a-b-c        1
x            1
d-e-f        1
y            1
g-h-i        1
a3-b3-c3-d3  1")

data
#>            V1 V2
#> 1       a-b-c  1
#> 2           x  1
#> 3       d-e-f  1
#> 4           y  1
#> 5       g-h-i  1
#> 6 a3-b3-c3-d3  1

# does not work as expected
sunburstR::sunburst(data)

# works as expected
sunburstR::sunburst(fix_data(data))

homer3018 commented 4 years ago

Well my point exactly. I'm doing precisely what you described where I shouldn't, since csv_to_hier should take care of that task. I'll do that no worries while this is not fixed, but that is a band aid that shouldn't exist is all I'm saying, hence this issue to get this function fixed..

cjyetman commented 4 years ago

Ok... I thought because you're "very annoyed" that you hadn't found a suitable workaround. @timelyportfolio will get to this when they get to it, and I'm confident they'll come up with a good solution.

timelyportfolio commented 4 years ago

@homer3018 thanks for filing the issue and @cjyetman thanks so much for stepping in with a quick reply. You are correct in that I do not want to add any more dependencies (actually given the time I would like to start removing dependencies). I plan to work through the potential solutions discussed and will report back today.

As a note, the csv_to_hier was originally provided in JavaScript form and proved unreliable so I converted to R. I still believe that the flat table structure is better and favor it instead of the csv form. Here is a little discussion of the other structure https://github.com/timelyportfolio/flattree.

timelyportfolio commented 4 years ago

@homer3018 @cjyetman I pushed some changes to csv_to_hier that I believe fixes the issue without adding dependencies. Here is the reproducible code using @cjyetman data. Please test and if all good I will submit to CRAN. Thanks!

csv <- read.table(header = FALSE, text = "
a-b-c        1
x            1
d-e-f        1
y            1
g-h-i        1
a3-b3-c3-d3  1")

sunburst(csvdata = csv)

cjyetman commented 4 years ago

Works for me! (using the dev version obviously)

homer3018 commented 4 years ago

All good as well from my perspective, good work 👍 Thanks !

timelyportfolio commented 4 years ago

Thanks @cjyetman @homer3018 I will submit to CRAN today.

timelyportfolio commented 4 years ago

Thanks again. Please reopen if this does not fully solve.

timelyportfolio / sunburstR

csv_to_hier not working as expected when only one element is spitted out by str_split #107