tilltnet / egor

R Package for importing and analysing ego-centered-network data.
http://egor.tillt.net
GNU Affero General Public License v3.0
23 stars 4 forks source link

Changing the egor object format & collecting use cases #27

Closed tilltnet closed 4 years ago

tilltnet commented 6 years ago

We (@martinamorris, @raffaelevacca, @mbojan, @krivit, @tilltnet) talked about the problems with the current egor object format at the Sunbelt in Utrecht and @mbojan suggested that we would gain speed and a more accessible egor object if we switched from the current format where alters (alts) and alter-alter ties (aaties) are stored in nested list columns of tibbles, to a format where the alts and aaties are stored as flat/ global dataframes with an egoID identifying alts and aaties belonging to one ego. In egonetR those global data frames were present as well and I preferred them for many operations. After doing some speed testing with the current egor implementation, I agree with @mbojan's suggestion. While these changes would mean that missing alters or missing aaties for an ego would create the necessity to take more care of result vectors in order for them to be appendable to the egos data frame, the gains in speed for big datasets make this approach a sensible choice. At the same time manipulations of the alts and aaties data would be easier to execute, since no *apply/map operations would be needed for those.

We agreed to collect use cases of the egor object and talk about the henceforth formats that should be the first choice for those operations. I will start by listing those that I am aware of here. I will also propose some properties of the new egor object. In the end I will propose an additional object format, that is organized in two data frames and which could be useful in some cases.

Use Cases

  1. Rename/ recode alt/ aatie variables (i.e. to make them correspond with ego variables). This can be done the easiest and fastest with a flat/ global data frame. Lists of data frames/ tibbles would make lapply/map like operations necessary, that are less easy for beginners to understand and more cumbersome for everyone.

  2. Conduct calculations on alts/ aaties per ego. Here lists of tibbles and flat/ global dataframes are at par. There should only be slight differences in speed and the handling is almost the same for both formats. Flat/ global dataframes could be processed with split()ing -> apply()ing, aggregate()ing, or by going with dplyr's group_by() and do(). Lists of tibbles would simply be processed with apply()/map() commands, which would have slight speed benefits. All of these would need the users to understand the concept of the aggregate/lapply logic and the definition of (anonymous) functions.

  3. Combining alter-attributes and alter-alter ties for structural measures. I think the best format here is a flat/ global aaties data frame where the corresponding source and target alter attributes are added as additional columns. Another option for those operations is an igraph object, which has some operators for such procedures.

  4. Comparing/ combining ego and alter attributes. Here I would suggest using a flat/ global alts data frame with the ego attributes added as columns to and repeated for each alter of an ego. Alternatively mapply()/map2() could be employed here, but I feel that this would make the handling much harder. The current implementation of the comp_ply() command supports this.

Which use cases did I not cover? Or do you have any other opinions on which format is suited best for any of the operations listed?

How to deal with missing alts/aaties

The built-in functions could simply join their results with the egoID from the egos dataframe and return a vector that could then be appended to the egos dataframe without the need of the user doing a join/merge operation. The current comp_ply() function could be rewritten to allow for custom operations on the alts data that would also return easily appendable vectors.

If the user would do calculations with the split/*apply or group_by/do logic the would need to make sure that they keep the egoID in order to be able to join/merge the results to the ego data frame. With dplyr's group_by/do combination this would automatically be the case.

Internals of the new egor object

An activate() function, as in tidygraph, should be introduced, that activates the alts, egos or aaties. Currently the only object that is necessary to build an egor object are the alters, egos and aaties are optional. I think this makes sense, and the new egor object should be designed similarly in that aspect. Consequentially the alters data frame should be activated by default. Or should it be the egos data frame, if present? Executing the activate function on an egor object without further arguments would activate the default data frame, be it egos or alters. dplyr's functions and []-indexing would be applied to the activated data frame. []-indexing would maintain the current unit argument, defaulting to the activated level.

I can imagine two options for how the general mechanic of working with the egor object could be implemented:

  1. Only the activated data level is returned. Inactive data-levels would be stored as an attribute of the activated data frame.

  2. The egor object is a list of the three data frames representing the different data levels. The activation would merely determine how the object is printed and onto which data frame indexing operations are executed.

I think there is an advantages to the second option, since each object would have an 'address' that would always be the same. On the other hand with the 2nd option there wouldn't be a need to create as many S3 methods as with the first option, since the regular dataframe/tibble methods would just work as expected on the activated data frame.

Protected/ essential columns of data frames

Other open questions

Other representations of the data that should be easy to draw from an egor object

Additional, two dataframes format

This could be useful as a general way of storing ego centered network data were alters could be members of several ego's networks and egos could be alters of other egos at the same time.

There should be only 2 tables/ dataframes necessary for this, which would make this more similar to the tidygraph object.

  1. A single actors data frame containing egos and alters and their attributes with a globally unique ID for each actor.

  2. A membership/ tie data frame that consists of at least three columns. This would contain ego-alter relations as well as alter-alter relations. The third column would mark if a relation is between ego and alter order between alters.

In addition to multi-network-memberships this format would also allow for the visualization and analysis of ego-centered networks, where not only direct alters are included, but also the ties of the direct alters and so on. While this format would not be necessary for classical ego-centered network studies it would be a format that would be the most agnostic of any field specific prerequisites. The established procedures used to manipulate and analyze ego-centered networks data would be mostly useless with this format. Stitching together ego networks to a socio-centric/ whole network would be the easy with this though. It could basically be a preparatory step for creating a single igraph/ network object from a ego centered network dataset.

I think these changes will take some time and we should also take time to discuss and find good solutions to all relevant questions. I am looking forward to hearing your opinions and suggestions! :)

mbojan commented 6 years ago

Thanks a lot @tilltnet for collecting all of this!

Let's add more usecases and make them as concrete as possible. Why don't we open separate issues for every usecase all labelled with a common label (say, "usecase"). Ideally we would like to have toy data examples, pseudo-code etc.?

I'll try to dig-out the the code I wrote some time ago in which I was musing about tidy handling of network data. Perhaps it will be an inspiration for how things can (or should not) be done.

krivit commented 6 years ago

@tilltnet , thanks for the writeup. My comments so far:

tilltnet commented 6 years ago

I added a wiki page for the collection of the use cases. That way we keep the issues page free for actual issues.

krivit commented 5 years ago

@martinamorris, @raffaelevacca, @mbojan, @krivit, @tilltnet, @mbojan, trying to pick this thread up again, it sounds like we have some idea about the data structure we want: a list of 3 tibbles, for egos, alters, and alter-alters, with egos being a srvyr object, with its API modelled loosely after tidygraph's activation.

One concern I do have is that the group_by and similar semantics don't apply to egocentric data directly. In particular to get a list of alters for each ego, it's not enough to split the alter table on ego IDs: we must also include empty tables for those egos that lack alters.

krivit commented 5 years ago

@martinamorris, @raffaelevacca, @mbojan, @krivit, @tilltnet, @mbojan, proposed updated specification at https://github.com/tilltnet/egor/wiki/egor2-format-specification .

tilltnet commented 5 years ago

Thank you @krivit for lining all of this out.

One concern I do have is that the group_by and similar semantics don't apply to egocentric data directly. In particular to get a list of alters for each ego, it's not enough to split the alter table on ego IDs: we must also include empty tables for those egos that lack alters.

I think instead of fixing this by introducing a group_by method the results of the calculation/ summarizing should just be joined to the ego dataframe by the egoID. For egor's exported functions this should be done internally and they should usually return a vector that is sorted by the egoID in the ego dataframe and contains NAs where no alters are present. I agree that this is a disadvantage to the current format, but the gains in speed and fewer list operations should outweigh this from the user's perspective.

Concerning your questions in the wiki:

  1. Should alter design be an egor list element or an attribute?

I think it would be more coherent with the srvyr object, to have it stored as and attribute of the alts object.

  1. To implement tidygraph-style semantics, should the currently activated attribute be a list element or an attribute?

As a list element it would be slightly more accessible - I'd prefer this, unless there are advantages/disadvantages I haven't thought of.

  1. Are the invariants too strict?

I think we should simply warn the user when any operation duplicates egoIDs in egos or altIDs per egoID in alts. Other functions of egor should then reject those objects.

Additionally the following rules need to be included:

  1. Deleting egos deletes their alts and aaties.
  2. Deleting alts deletes the corresponding aaties.

In addition to subset methods there should be methods to dplyr::filter for this.

  1. Should the user be able to manually specify the .egoID, .altID, etc., and should they be allowed to be characters as well?

What do you mean by manually specify? Choosing the column names themselves or the actual values? I think the names should be fixed but the values should be kept from the user provided data. Characters should be allowed, yes! That way the user can see the relation to their 'raw data' and it's easier to inspect and compare the original data to the egor object.

Ego table (egos) - This table is a srvyr object...

What if no survey design is specified? Should the egos dataframe still be a srvyr object or just a plain tibble? From what I understand a srvyr object is prevented from reordering with dplyr::arrange, which I think is a good thing. Are there any other limitations when using the srvyr object instead of a tibble, that we should think of?

martinamorris commented 5 years ago

I think instead of fixing this by introducing a group_by method the results of the calculation/ summarizing should just be joined to the ego dataframe by the egoID. For egor's exported functions this should be done internally and they should usually return a vector that is sorted by the egoID in the ego dataframe and contains NAs where no alters are present.

to make sure i understand: you're proposing that summaries based on the alters of an ego (for example: number (ego degree), % male, min(age)) will return a vector sorted by egoID? a couple of thoughts on that:

I'm not sure I know enough about methods to comment on group_by vs. a dedicated exported function, but I tend to think it would be preferable to use adapted tidyverse methods whenever possible.

I think it would be more coherent with the srvyr object, to have it stored as and attribute of the alts object.

agreed -- and are there use cases for transforming the alts object into a srvyr object?

  1. Should the user be able to manually specify the .egoID, .altID, etc., and should they be allowed to be characters as well?

What do you mean by manually specify? Choosing the column names themselves or the actual values? I think the names should be fixed but the values should be kept from the user provided data. Characters should be allowed, yes! That way the user can see the relation to their 'raw data' and it's easier to inspect and compare the original data to the egor object.

It may be worth distinguishing between an internal, automatically set numerical id (that the user can see/query, but can not change), and the node identifier(s) that come with the dataset. Esp. in the case of character "names", there may be duplicates.

What if no survey design is specified? Should the egos dataframe still be a srvyr object or just a plain tibble? From what I understand a srvyr object is prevented from reordering with dplyr::arrange, which I think is a good thing. Are there any other limitations when using the srvyr object instead of a tibble, that we should think of?

  1. I'm not sure why you think it's a good thing not to reorder a srvyr object. Why wouldn't want to arrange (i.e., sort) your egodata object by attributes? Assuming we'd want to do this, is there a simple conversion like as.tibble (which i don't think currently transforms srvyr objects)

  2. assuming we want to impose the srvyr object format, if no design is specified, are there some natural defaults (e.g., weight=1) that can be attributed?

tilltnet commented 5 years ago

to make sure i understand: you're proposing that summaries based on the alters of an ego (for example: number (ego degree), % male, min(age)) will return a vector sorted by egoID? a couple of thoughts on that:

yes, that's what i meant. To be more precise, they would be ordered by the order of egoIDs in the egos object, NOT by the alphanumeric order of the egoIDs.

it might be worth giving an option to return a 2 col matrix that includes egoID (this would facilitate error free joining in any context)

I do agree with this to a certain extend, at the same time do really want to keep the available arguments to all commands to the necessary minimum. As a compromise we could make the package's functions return named vectors, with the names being the egoIDs. That way the association between the result values and the henceforth ego would be obvious and joining the results to incomplete/reordered egos dataframes could still be facilitated.

for sparse nets with lots of nodes with degree=0, it might be worth an option that returns just the degree >0 result, with the egoID.

Filtering the results should be done with established base-r/tidygraph methods (i.e. na.exclude/na.omit), instead of adding new, package-specific ways for this. But I do agree that this is a common thing that researchers would need and we should include a corresponding example in the vignette. I could imagine something like this: get results - bind them to egos - filter egos (i.e. degree > 0) - cross tabulate with ego attribute.

I'm not sure I know enough about methods to comment on group_by vs. a dedicated exported function, but I tend to think it would be preferable to use adapted tidyverse methods whenever possible.

I do agree that methods for the tidyverse functions are the way to go and will be needed for certain things, but for now I feel that for this specific problem it's not needed or might even by hard to accomplish a sensible method here, because (if think!) it's hard to predict, what the user wants to do with the groupings and the variance in form of the user created results is quite vast. The functions that usually follow the group_by() command (i.e. summary() and do()) return the grouping variable alongside the results and just as you described in your post:

[...] return a 2 col matrix that includes egoID (this would facilitate error free joining in any context)

It may be worth distinguishing between an internal, automatically set numerical id (that the user can see/query, but can not change), and the node identifier(s) that come with the dataset. Esp. in the case of character "names", there may be duplicates.

I would prefer not to have redundant ID systems. The data import functions should check for duplicates in egoIDs in egos and alterIDs per ego and throw an error when there are duplicates. Those duplicates could be an indication for problems in the raw data, that the user would have to take care of before the import. The error message could list the problematic cases, so that the user is enabled to investigate and fix the problems more easily. Fixing duplicates in the background would give a false sense of correct data.

I'm not sure why you think it's a good thing not to reorder a srvyr object. Why wouldn't want to arrange (i.e., sort) your egodata object by attributes? Assuming we'd want to do this, is there a simple conversion like as.tibble (which i don't think currently transforms srvyr objects)

From the srvyr vignette:

Note that arrange() is not available, because the srvyr object expects to stay in the same order. Nor are two-table verbs such as full_join(), bind_rows(), etc. available to srvyr objects either because they may have implications on the survey design. If you need to use these functions, you should use them earlier in your analysis pipeline, when the objects are still stored as data.frames.

So, other than my personal preference of not reordering the original data, this is a restriction put in place by the authors of the srvyr package (I have yet to investigate why this is the case!).

But actually the 2nd and 3rd sentence of that quote make me think about how we should approach this. If we can't use _joins() since they could change the number of rows of the egos object, maybe it is better to have the egos object as a regular tibble and let the user convert it to a srvyr object after the ego-network specific operations have been completed? Or we could have methods for the _join() operations and for merge() that err on duplicated egoIDs. In the y argument.

assuming we want to impose the srvyr object format, if no design is specified, are there some natural defaults (e.g., weight=1) that can be attributed?

Yes, this is possible and also how it's done in the current version.

krivit commented 5 years ago

So, other than my personal preference of not reordering the original data, this is a restriction put in place by the authors of the srvyr package (I have yet to investigate why this is the case!).

That is unfortunate. As far as I can tell, survey package is actually pretty smart about indexing: repeating rows creates a cluster sample:

> example(svydesign)
[...]
> dstrat
Stratified Independent Sampling design
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
> nrow(dstrat)
[1] 200
> dstrat[rep(1:200,each=2),]
Stratified 1 - level Cluster Sampling design
With (200) clusters.
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
> dstrat[rep(200:1,each=2),]
Stratified 1 - level Cluster Sampling design
With (200) clusters.
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
krivit commented 5 years ago

@tilltnet @mbojan @raffaelevacca @martinamorris

It seems that we are stalled again. We seem to have an outline of a format that works (and anyone should feel free to edit the wiki), with only minor details still up in the air, so perhaps we should fork a branch and get started. In addition, we should probably sort out who should update which component.

tilltnet commented 5 years ago

I am currently not able to do very much on this, but that should change in the middle of February.

I'd like to rewrite the egor() function as well as the checks needed for the egor objects. Also I would take care of the dplyr methods and the analysis functions. @krivit if you could take care of the survey design implementation and the square bracket indexing of the egor object, would that work? I'll try to create a branch and an initial, very basic egor() function this weekend, so we can start with the other parts.

btw I submitted an update to CRAN yesterday, because some tests failed due to the new dplyr version coming up soon. The update was accepted today.

krivit commented 5 years ago

@tilltnet , happy to. I should probably handle the activation mechanism as well.

krivit commented 5 years ago

@tilltnet , were you going to create a development branch for the restructuring? Should I?

tilltnet commented 5 years ago

I created it already, it is here: https://github.com/tilltnet/egor/tree/speedy-g I'm sorry I should have mentioned that here.

tilltnet commented 4 years ago

All functions are now rewritten to work as per the new specifications. Since the current CRAN version of egor is not working anymore with the most recent version of tidyr, I will merge branches and submit to CRAN sometime this week.

@krivit I was not able to get the subsetting (subset.egor and []-subsetting) to work and butchered your functions until I got it to work in a basic manner. It now works, as specified in the examples. I took out a big chunk of code though, that I assume was meant to make the evaluation of the condition context aware. If you want to restore the code here is the link to the version before my commit. Sorry for this!

Currently the specification of a survey design, is not supported. As this is an important feature, I am aiming to add that functionality in an update in ~2 months.

Other significant changes are:

krivit commented 4 years ago

Thank you for undertaking the Herculean effort! I will try to fix the indexing functions tomorrow.

krivit commented 4 years ago

As an update, I am working to re-add support for survey sampling by making the ego table a srvyr object so that the sampling design is kept in sync with the data.

krivit commented 4 years ago

It turns out that srvyr objects aren't particularly friendly to manipulation of the sort that is needed for subsetting and such, so it's taking longer to get it to work than I would have liked, but I'm making progress.

tilltnet commented 4 years ago

Sounds good @krivit !!

I have merged the speedy-g branch into master now and submitted the current version to CRAN, in order to have a working version on CRAN.

tilltnet commented 4 years ago

I am about to put together the next CRAN version, with a bunch of fixes and improvements to the plotting and import functions and a huge speed improvement to the trim_aaties() function, that deletes aaties, whenever egos/alters are deleted form the dataset. I also added a vignette for working with the ego centered network data from the german Allbus survey. Will do some changes to the vis app and then push to CRAN sometime this week.

@krivit I was reading a bit about the srvyr object and have some thoughts about how to integrate it best into egor. Since the srvyr object comes along with these restrictions I think we should:

  1. Make it optional:

    • Detect if the dots (...) argument of egor() includes any of the "main" arguments of as_survey_design (ids, probs, strata, fpc, weights should do the job).
    • If any of those are found use them to convert the ego table to a srvyr object.
  2. Add a function that converts the ego table to a srvyr object. This could just be a method for as_survey_design. Since the Allbus includes weights, I could include examples in the new vignette, that show how use the egor() function with weights and, alternatively, how to apply them later on. The instructions to the user could then say, that they should mostly manipulate the ego table before making it a srvyr object.

krivit commented 4 years ago

@tilltnet , sorry, didn't see this note until just now. I just pushed a branch ego_srvyr that changes the ego tibble to a tbl_svy object, and I agree that the result is very clunky.

Making design optional makes sense. Maybe we can have two subclasses of egor, say, egor_ego_df and egor_ego_svy for those functions that need to handle the two data types differently.

krivit commented 4 years ago

@tilltnet, OK, the current ego_srvyr branch handles design pretty seamlessly. Basically, if you have an egor object e, ego_design(e) <- list(~clusterID) will give it a corresponding design and turn e$ego into a tbl_svy object, whereas ego_design(e) <- NULL will clear the design and turn e$ego back into a tbl_df.

It passes the check at least as well as master. Let me know if you're happy for me to merge it in.

tilltnet commented 4 years ago

Looks good to me - please go ahead with the merge. Locally I had some issues, I can work on them and submit to CRAN hopefully this weekend. This would line up perfectly with a few changes needed due to the upcoming dplyr 1.0.0 version (#37).

krivit commented 4 years ago

Done. Anything else we need to do on this ticket?

tilltnet commented 4 years ago

I think we can close this. I made some tiny fixes and will submit to CRAN shortly.

tilltnet commented 4 years ago

All is done and egor 0.20.06 is on CRAN as of today.