plotly / plotly.R

An interactive graphing library for R
https://plotly-r.com
Other
2.55k stars 622 forks source link

Sankey node positions overridden for some uneven flows, and rules for node.x and node.y manual positions are not clear #2102

Open even-of-the-hour opened 2 years ago

even-of-the-hour commented 2 years ago

Main problem: Nodes appear in order of data frame under some conditions (such as symmetric flows) but under unknown conditions (some asymmetric flows, but not all), they appear out of order according to other, unknown rules. Manual positioning using node.x and node.y also has unclear rules. I'm trying to work around the lack of a sorting feature but hitting snags all over the place.

Forgive me, I'm rather new to plotly and don't understand how plotly.R interacts with python or js plotly. In trying to solve this problem, I see Issue #4373 for plotly.js describes lack of a sort feature and Issue #3002 for plotly.py states that node.x and node.y cannot be 0.

My use case is that I want to produce a large set of sankey graphs for flows between 5 specific nodes at Time1 and 5 specific nodes at Time2. For this reason, I would like my nodes to be drawn in the same order every time, no matter the size of the nodes or flows. I wrote script to dynamically find the correct node.y positions for nodes based on their order and size. Even this workaround is running into problems as noted in the code below.

Minimally, I guess I'm looking for more detailed documentation about node.x and node.y compared to what is currently in the reference page.

More broadly, why is the data frame order of the nodes being overridden, such as in the uneven_flows example below?

library(plotly)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout
library(tidyverse)

my_labels <-
  c(
    "Node 0",
    "Node 1",
    "Node 2",
    "Node 3",
    "Node 4",
    "Node 5",
    "Node 6",
    "Node 7",
    "Node 8",
    "Node 9"
  )

# Uses original data, which includes some flows much larger than others
source_ids <-
  c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
target_ids <-
  c(5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 5, 6, 7, 8, 9)
varying_flows <-
  c(60, 23, 1, 0, 9, 15, 33, 13, 4, 3, 0, 9, 8, 2, 1, 0, 4, 12, 127, 9, 4, 4, 1, 11, 1)

my_varying_flows <- data.frame(source_ids, target_ids, varying_flows)

fig1 <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(
    label = my_labels), 
  link = list(
    source = my_varying_flows$source_ids,
    target = my_varying_flows$target_ids,
    value = my_varying_flows$varying_flows))

fig1 <- fig1 %>%
  layout(
    title = list(
      text = "fig1 - varying flows out of order"
    )
  )

# Nodes do not appear in intended order. Node 3, the largest node, appears below
# Node 4, and the right hand nodes are also out of order.

fig1

fig1

# Build a new set of test data with even, identical flows
even_flows <- rep(10, times = 25)
my_even_flows <- data.frame(source_ids, target_ids, even_flows)

fig2 <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(

    label = my_labels), 
  link = list(
    source = my_even_flows$source_ids,
    target = my_even_flows$target_ids,
    value = my_even_flows$even_flows))

fig2 <- fig2 %>%
  layout(
    title = list(
      text = "fig2 - even flows in order"
    )
  )

# Displays nodes in intended order, apparently because something behind the
# scenes likes the even flows and keeps the default arrangement.
fig2

fig2

# Workaround to dynamically determine node.y positions relative to size of nodes
# and sorting order in original data. But even this behaves in unexpected ways,
# and in the node.y argument we need to take the complement of them (i.e., 1 -
# the value generated here).

label_pos_dfs <-
  list(
    # Label positions of source node labels
    my_varying_flows %>%
      group_by(source_ids) %>%
      summarize(n = sum(varying_flows)) %>%
      rename(node.name = source_ids) %>%
      mutate(label.pos = 1 - (cumsum(n) - n/2) / sum(n)),

    # Label positions of target node labels
    my_varying_flows %>%
      group_by(target_ids) %>%
      summarize(n = sum(varying_flows)) %>%
      rename(node.name = target_ids) %>%
      mutate(label.pos = 1 - (cumsum(n) - n/2) / sum(n))
  )

my_node_label_y_positions <- 
  lapply(label_pos_dfs, "[", "label.pos") %>% 
  bind_rows() %>% 
  pull(label.pos) 

fig3 <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(
    label = my_labels,

    # Avoiding 0 values seemed to help
    x = c(1e-03, 1e-03, 1e-03, 1e-03, 1e-03, 1, 1, 1, 1, 1),

    # Not clear to me why these didn't work and we instead need their
    # complements (e.g., 1 - original value) for correct placement, as if the
    # node.y positions were the distance from the top, not the bottom?
    y = my_node_label_y_positions * -1 + 1), 

  link = list(
    source = my_varying_flows$source_ids,
    target = my_varying_flows$target_ids,
    value = my_varying_flows$varying_flows))

fig3 <- fig3 %>%
  layout(
    title = list(
      text = "fig3 - varying flows in intended order with odd workaround!"
    )
  )

# Nodes appear in intended order. 

# fig3

fig3

Created on 2022-01-29 by the reprex package (v2.0.1)

kforthman commented 1 year ago

I would like to thank @even-of-the-hour for sharing this. This is a great solution for sorting the nodes exactly how you want them. I have added to this code to accommodate 3 levels and wanted to share in case it helps out anyone else.

# Creating dummy data; 3 levels, each with 4 nodes.
d1 <- 
  c(0:3) %>% rep(4) %>% rep(4) %>% sort
d2 <-
  c(4:7) %>% rep(4) %>% sort %>% rep(4)
d3 <-
  c(8:11) %>% sort %>% rep(4) %>% rep(4)
varying_flows <- rpois(64,0.25)

my_labels <- paste0("Node ", 1:12)

my_varying_flows <- data.frame(d1, d2, d3, varying_flows)

# Convert the data to the format required by sankey function
for(i in 1:2){
    group1 <- c("d1","d2")[i]
    group2 <- c("d2","d3")[i]
    my_varying_flows_thick <- my_varying_flows %>% group_by(!!as.name(group1),!!as.name(group2)) %>% summarise(sum(varying_flows))
    colnames(my_varying_flows_thick) <- c("source", "target", "thickness")

    source.label.pos <- my_varying_flows_thick %>% 
    group_by(source) %>%
    summarize(n = sum(thickness)) %>%
    mutate(source.label.pos = 1 - (cumsum(n) - n/2) / (sum(n)))

    target.label.pos <- my_varying_flows_thick %>% 
    group_by(target) %>%
    summarize(n = sum(thickness)) %>%
    mutate(target.label.pos = 1 - (cumsum(n) - n/2) / (sum(n)))

    my_varying_flows_thick$source.label.pos <- source.label.pos$source.label.pos[match(my_varying_flows_thick$source, source.label.pos$source)]
    my_varying_flows_thick$target.label.pos <- target.label.pos$target.label.pos[match(my_varying_flows_thick$target, target.label.pos$target)]

    if(i ==1){
    my_varying_flows_data <- my_varying_flows_thick; next
    }
    my_varying_flows_data <- rbind(my_varying_flows_data,my_varying_flows_thick)
}

# Calculate x,y position
node_x <- sort(rep(c(0:2),4))/2 + c(rep(0.001, 4), rep(0,8))

node_y <- my_varying_flows_data[,c("source","source.label.pos")] %>% group_by() %>% unique %>% 
  select(source.label.pos) %>% 
  unlist %>% as.numeric
node_y <- c(node_y,my_varying_flows_data[,c("target","target.label.pos")] %>% group_by() %>% 
              filter(!target %in% my_varying_flows_data$source) %>% unique %>% 
              select(target.label.pos) %>% 
              unlist %>% as.numeric)
node_y <- node_y * -1 + max(node_y)
node_y <- node_y %>% round(3)
node_y[node_y == 0] <- 0.001
node_y

# Plot
fig4 <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(
    label = my_labels,

    # Avoiding 0 values seemed to help
    x = node_x,

    # Not clear to me why these didn't work and we instead need their
    # complements (e.g., 1 - original value) for correct placement, as if the
    # node.y positions were the distance from the top, not the bottom?
    y = node_y
  ),
  link = list(
    source = my_varying_flows_data$source,
    target = my_varying_flows_data$target,
    value = my_varying_flows_data$thickness
    )
  )

fig4 <- fig4 %>%
  layout(
    title = list(
      text = "fig4 - varying flows in intended order with odd workaround;3 levels"
    )
  )

# Nodes appear in intended order. 

fig4

test_plot