statnet / network

Classes for Relational Data
Other
15 stars 8 forks source link

Merge the implementations of as.data.frame.network() and as_tibble.network(). #49

Closed krivit closed 1 year ago

krivit commented 3 years ago

I can't believe I hadn't noticed this in the original PR: the two functions duplicate code and functionality, even though a tibble is, technically, a data.frame. @knapply , what do you think? Was there any particular reason for reimplementing rather than adapting the existing implementation?

knapply commented 3 years ago

@krivit , thanks for your patience.

This was driven by the desire for a reverse behavior for as.network.data.frame() that I believe originally came up here: https://github.com/statnet/network/pull/20#issuecomment-567620332, but I don't recall considering modifying as_tibble.network() as this was after as.network.data.frame() was largely sorted out and because it would have required gutting as_tibble.network() (which I believe was already on CRAN at the time).

While they do both technically return a data.frame, their behaviors differ because as.data.frame.network() attempts to reconstruct the original data frames that could've been passed to as.network.data.frame() in a way that facilitates reverse verification -- and the test battery surrounding that it pretty rigorous at this point.

At a glance, modifying as_tibble.network() to get the same behavior as as.data.frame.network() would require: 1) changing list.vertex.attributes() and list.edge.attributes() to not sort attribute names before returning them 2) returning "vertex.names" (when present) instead of their indices 3) defaulting to not returning the "na" edge/vertex attributes 4) ensuring as_tibble.network() gets some help to ensure "sticky" attributes (of the R variety)

Here's an example that hits all of these incompatibilities (again, this is just at a quick glance -- I'd be shocked if others aren't lurking):

suppressPackageStartupMessages(library(network))

edge_df <- data.frame(
  .tail = c("b", "c", "c", "d", "d", "e"),
  .head = c("a", "b", "a", "a", "b", "a"),
  lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),
  int_attr = as.integer(seq_len(6)),
  dbl_attr = as.double(seq_len(6)),
  chr_attr = LETTERS[1:6],
  date_attr = seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), 
                       by = 1),
  dttm_attr = as.POSIXct(
    seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), by = 1)
    )
  ,
  stringsAsFactors = FALSE
)

g <- as.network(edge_df)

tibble::as_tibble(g, attrnames = TRUE)
#> # A tibble: 6 x 9
#>   .tail .head chr_attr date_attr dbl_attr  dttm_attr int_attr lgl_attr na   
#>   <int> <int> <chr>        <dbl>    <dbl>      <dbl>    <int> <lgl>    <lgl>
#> 1     1     5 A            18252        1 1576972800        1 TRUE     FALSE
#> 2     2     1 B            18253        2 1577059200        2 FALSE    FALSE
#> 3     2     5 C            18254        3 1577145600        3 TRUE     FALSE
#> 4     3     5 D            18255        4 1577232000        4 FALSE    FALSE
#> 5     3     1 E            18256        5 1577318400        5 TRUE     FALSE
#> 6     4     5 F            18257        6 1577404800        6 FALSE    FALSE
tibble::as_tibble(as.data.frame(g, attrnames = TRUE))
#> # A tibble: 6 x 8
#>   .tail .head lgl_attr int_attr dbl_attr chr_attr date_attr  dttm_attr          
#>   <chr> <chr> <lgl>       <int>    <dbl> <chr>    <date>     <dttm>             
#> 1 b     a     TRUE            1        1 A        2019-12-22 2019-12-21 19:00:00
#> 2 c     b     FALSE           2        2 B        2019-12-23 2019-12-22 19:00:00
#> 3 c     a     TRUE            3        3 C        2019-12-24 2019-12-23 19:00:00
#> 4 d     a     FALSE           4        4 D        2019-12-25 2019-12-24 19:00:00
#> 5 d     b     TRUE            5        5 E        2019-12-26 2019-12-25 19:00:00
#> 6 e     a     FALSE           6        6 F        2019-12-27 2019-12-26 19:00:00

identical(
  edge_df,
  as.data.frame(g)
)
#> [1] TRUE

1) as.data.frame.network() maintains the original column order, but as_tibble.network() sorted them by name 2) as.data.frame.network() returns the original .tail and .head columns (using vertex names), while as_tibble.network() returns the indices 3) as.data.frame.network() excludes the "na" attribute by default (as it's not likely to exist in the original data frame, but it still provides a way of getting it (attrs_to_ignore = NULL) 4) as.data.frame.network() reconstructs the date_attr and dttm_attr columns correctly by retrieving and distributing the original attributes, while as_tibble.network() strips them.

At the end of the day, we want to know that round-trips are safe so...

suppressPackageStartupMessages(library(network))

edge_df <- data.frame(
  .tail = c("b", "c", "c", "d", "d", "e"),
  .head = c("a", "b", "a", "a", "b", "a"),
  lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),
  int_attr = as.integer(seq_len(6)),
  dbl_attr = as.double(seq_len(6)),
  chr_attr = LETTERS[1:6],
  date_attr = seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), 
                       by = 1),
  dttm_attr = as.POSIXct(
    seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), by = 1)
  )
  ,
  stringsAsFactors = FALSE
)

vertex_df <- data.frame(vertex.names = letters[5:1],
                        lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE),
                        int_attr = as.integer(seq_len(5)),
                        dbl_attr = as.double(seq_len(5)),
                        chr_attr = LETTERS[1:5],
                        date_attr = seq.Date(as.Date("2019-12-22"),
                                             as.Date("2019-12-26"),
                                             by = 1),
                        dttm_attr = as.POSIXct(
                          seq.Date(as.Date("2019-12-22"), as.Date("2019-12-26"), by = 1)
                        ),
                        stringsAsFactors = FALSE)
attr(vertex_df$date_attr, "tzone") <- "PST"
attr(vertex_df$dttm_attr, "tzone") <- "EST"
vertex_df$list_attr <- replicate(5, LETTERS, simplify = FALSE)
vertex_df$mat_list_attr <- replicate(5, as.matrix(mtcars), simplify = FALSE)
vertex_df$df_list_attr <- replicate(5, mtcars, simplify = FALSE)

... that this...

g2 <- as.network(edge_df, vertices = vertex_df)

identical(
  g2,
  as.network(as.data.frame(g2), vertices = as.data.frame(g2, unit = "vertices"))
)
#> [1] TRUE

... works.

krivit commented 3 years ago

Thanks! I think that it might be possible to deprecate as_tibble.network(), provided no functionality is lost. (Removing dependence on tibble might also be a good thing on principle.)

Starting with as.data.frame.network(), we can get almost all functionality of as_tibble.network() as follows:

  1. Add an option to sort columns.
  2. Add an option to return vertex indices or names.
  3. Add an option to store edge IDs.

What's left is extracting specific edge/vertex attributes, but that's trivial for the end-user to do.

mbojan commented 1 year ago

Method is implemented.