Closed krivit closed 1 year ago
@krivit , thanks for your patience.
This was driven by the desire for a reverse behavior for as.network.data.frame()
that I believe originally came up here: https://github.com/statnet/network/pull/20#issuecomment-567620332, but I don't recall considering modifying as_tibble.network()
as this was after as.network.data.frame()
was largely sorted out and because it would have required gutting as_tibble.network()
(which I believe was already on CRAN at the time).
While they do both technically return a data.frame
, their behaviors differ because as.data.frame.network()
attempts to reconstruct the original data frames that could've been passed to as.network.data.frame()
in a way that facilitates reverse verification -- and the test battery surrounding that it pretty rigorous at this point.
At a glance, modifying as_tibble.network()
to get the same behavior as as.data.frame.network()
would require:
1) changing list.vertex.attributes()
and list.edge.attributes()
to not sort attribute names before returning them
2) returning "vertex.names"
(when present) instead of their indices
3) defaulting to not returning the "na"
edge/vertex attributes
4) ensuring as_tibble.network()
gets some help to ensure "sticky" attributes (of the R variety)
Here's an example that hits all of these incompatibilities (again, this is just at a quick glance -- I'd be shocked if others aren't lurking):
suppressPackageStartupMessages(library(network))
edge_df <- data.frame(
.tail = c("b", "c", "c", "d", "d", "e"),
.head = c("a", "b", "a", "a", "b", "a"),
lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),
int_attr = as.integer(seq_len(6)),
dbl_attr = as.double(seq_len(6)),
chr_attr = LETTERS[1:6],
date_attr = seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"),
by = 1),
dttm_attr = as.POSIXct(
seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), by = 1)
)
,
stringsAsFactors = FALSE
)
g <- as.network(edge_df)
tibble::as_tibble(g, attrnames = TRUE)
#> # A tibble: 6 x 9
#> .tail .head chr_attr date_attr dbl_attr dttm_attr int_attr lgl_attr na
#> <int> <int> <chr> <dbl> <dbl> <dbl> <int> <lgl> <lgl>
#> 1 1 5 A 18252 1 1576972800 1 TRUE FALSE
#> 2 2 1 B 18253 2 1577059200 2 FALSE FALSE
#> 3 2 5 C 18254 3 1577145600 3 TRUE FALSE
#> 4 3 5 D 18255 4 1577232000 4 FALSE FALSE
#> 5 3 1 E 18256 5 1577318400 5 TRUE FALSE
#> 6 4 5 F 18257 6 1577404800 6 FALSE FALSE
tibble::as_tibble(as.data.frame(g, attrnames = TRUE))
#> # A tibble: 6 x 8
#> .tail .head lgl_attr int_attr dbl_attr chr_attr date_attr dttm_attr
#> <chr> <chr> <lgl> <int> <dbl> <chr> <date> <dttm>
#> 1 b a TRUE 1 1 A 2019-12-22 2019-12-21 19:00:00
#> 2 c b FALSE 2 2 B 2019-12-23 2019-12-22 19:00:00
#> 3 c a TRUE 3 3 C 2019-12-24 2019-12-23 19:00:00
#> 4 d a FALSE 4 4 D 2019-12-25 2019-12-24 19:00:00
#> 5 d b TRUE 5 5 E 2019-12-26 2019-12-25 19:00:00
#> 6 e a FALSE 6 6 F 2019-12-27 2019-12-26 19:00:00
identical(
edge_df,
as.data.frame(g)
)
#> [1] TRUE
1) as.data.frame.network()
maintains the original column order, but as_tibble.network()
sorted them by name
2) as.data.frame.network()
returns the original .tail and .head columns (using vertex names), while as_tibble.network()
returns the indices
3) as.data.frame.network()
excludes the "na"
attribute by default (as it's not likely to exist in the original data frame, but it still provides a way of getting it (attrs_to_ignore = NULL
)
4) as.data.frame.network()
reconstructs the date_attr
and dttm_attr
columns correctly by retrieving and distributing the original attributes, while as_tibble.network()
strips them.
At the end of the day, we want to know that round-trips are safe so...
suppressPackageStartupMessages(library(network))
edge_df <- data.frame(
.tail = c("b", "c", "c", "d", "d", "e"),
.head = c("a", "b", "a", "a", "b", "a"),
lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),
int_attr = as.integer(seq_len(6)),
dbl_attr = as.double(seq_len(6)),
chr_attr = LETTERS[1:6],
date_attr = seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"),
by = 1),
dttm_attr = as.POSIXct(
seq.Date(as.Date("2019-12-22"), as.Date("2019-12-27"), by = 1)
)
,
stringsAsFactors = FALSE
)
vertex_df <- data.frame(vertex.names = letters[5:1],
lgl_attr = c(TRUE, FALSE, TRUE, FALSE, TRUE),
int_attr = as.integer(seq_len(5)),
dbl_attr = as.double(seq_len(5)),
chr_attr = LETTERS[1:5],
date_attr = seq.Date(as.Date("2019-12-22"),
as.Date("2019-12-26"),
by = 1),
dttm_attr = as.POSIXct(
seq.Date(as.Date("2019-12-22"), as.Date("2019-12-26"), by = 1)
),
stringsAsFactors = FALSE)
attr(vertex_df$date_attr, "tzone") <- "PST"
attr(vertex_df$dttm_attr, "tzone") <- "EST"
vertex_df$list_attr <- replicate(5, LETTERS, simplify = FALSE)
vertex_df$mat_list_attr <- replicate(5, as.matrix(mtcars), simplify = FALSE)
vertex_df$df_list_attr <- replicate(5, mtcars, simplify = FALSE)
... that this...
g2 <- as.network(edge_df, vertices = vertex_df)
identical(
g2,
as.network(as.data.frame(g2), vertices = as.data.frame(g2, unit = "vertices"))
)
#> [1] TRUE
... works.
Thanks! I think that it might be possible to deprecate as_tibble.network()
, provided no functionality is lost. (Removing dependence on tibble
might also be a good thing on principle.)
Starting with as.data.frame.network()
, we can get almost all functionality of as_tibble.network()
as follows:
What's left is extracting specific edge/vertex attributes, but that's trivial for the end-user to do.
Method is implemented.
I can't believe I hadn't noticed this in the original PR: the two functions duplicate code and functionality, even though a
tibble
is, technically, adata.frame
. @knapply , what do you think? Was there any particular reason for reimplementing rather than adapting the existing implementation?