kaijagahm / vultureUtils

Utility functions for working with vulture data
Other
4 stars 0 forks source link

Isolated nodes not included in edge lists #42

Open kaijagahm opened 1 year ago

kaijagahm commented 1 year ago

When calculating graph densities for my permuted networks, I suddenly realized, with a sinking feeling, that the edge lists produced by get*Edges() don't include isolated nodes. This is a pretty big problem and has the potential to significantly effect density measures and other network-level things.

Luckily, isolated nodes can be added in in the makeGraph stage by specifying the vertices argument in igraph::graph_from_data_frame(). For example, if nodes A, B, and C are included in the edgelist passed to makeGraph, but then somewhere in makeGraph a vector of vertices A, B, C, and D are passed as vertices, then the resulting graph will include the specified edges as well as D as an isolated node.

One way to do this would be to surface the igraph::graph_from_data_frame() vertices argument in makeGraph, to allow the user to pass a list of vertices to makeGraph. That means makeGraphs would need to take two inputs: an edgelist and a vertex vector/data frame.

This approach makes me nervous, because I know how many errors tend to get introduced into the code when you require the user to 1) correctly create two different objects and then 2) correctly pair the objects together to pass them into a function call. What if you pass the right edge list but the wrong vertex list? Or vice versa? I just know I'm going to mess this up. It wouldn't be a big deal if the list of vertices to be included was always the same. But it won't be! Before creating edge lists, I do all kinds of filtering (by geography, by time interval, etc.) I only want to include individuals in the network if they were present in the pre-edgelist dataset, whether or not they participated in whatever type of edge is being specified.

The problem is, there's no obvious way to include isolated nodes in an edge list.

So, I came up with a sort of hacky solution. I created an option to attach the vector of vertices as a "rider" to the output of get*Edges(). Now, if you specify the argument includeAllVertices = T in get*Edges() (the default is F), instead of just returning an edgelist (aka a data frame), the function will return a list containing two items. $edges is a data frame/edgelist, and $allVertices is a vector of all vertices. If you specify includeAllVertices = F (the default), then the function will still just return a simple data frame/edgelist.

Next, I modified the makeGraph function so that it does take a vertices argument. This is a partial concession to my idea above--technically, you can pass in any vector you like to that argument. But it provides an easy path to do a call like this:

g <- vultureUtils::makeGraph(edgelistObject$edges, weighted = T, vertices = edgelistObject$allVertices) # calling different elements of the same list to pass as the different arguments to makeGraph.

I was a little hasty implementing this, so PRs 38, 39, 40, and 41 all pertain to the above-described fix.

kaijagahm commented 1 year ago

^ (Everything above is from yesterday.)

Today I did some more thinking. Thought of two more potential solutions for including isolated nodes: 1) Create a full list of self edges, append it to the returned edgelist. Then, after/during creating the graph, simplify the graph to remove self edges (but leave the isolated nodes). I tried a minimal example of this and it seemed to work, but when I implemented it in the package, it didn't. I didn't try for very long, but it seemed like a waste of my time when I already had a semi-working solution. 2) Instead of self edges, have edges with NA as one of the endpoints, or at least use that as a "carrier" of nodes not included elsewhere, to be split apart and modified inside the makeGraph function. Some demonstration of that here. I actually think this would be a marginally more elegant solution than what I have, since it wouldn't require dealing with as many list objects. But it would take a while to implement, and it would also make the inclusion of those NA's a bit cryptic (since they'd just be in the data frame masquerading as normal edges). By default, igraph converts NA to character and actually names the node "NA", which is super annoying and unhelpful. So I foresee friction down the road if I or someone else forgot to do the extra step of removing the nodes. Seems preferable to make the default case the simpler one, with the optional includeAllVertices = T argument to slightly simplify the complicated case.

kaijagahm commented 1 year ago

This will be partially addressed by the SRI calculation, which produces an SRI data frame including all individuals in the dataset, giving SRI values of 0 where appropriate.