nstrayer / sbmr_old

An R library for fitting Stochastic Block Models on network data
0 stars 0 forks source link

Paths forward/ Goals #1

Open nstrayer opened 5 years ago

nstrayer commented 5 years ago

This repo is based on this tweet by @alexpghayes.

For the past few months I have been off an on working on a SBM package in R internally for my lab (specifically for bipartite sbms). The main approach I use is to just use docker to run a predefined graph-tool python script. This works but is bad for a few reasons.

  1. Docker adds overhead to something that's already computationally rigorous.
  2. If we want to change something or look at some other aspect it's a pain to setup.
  3. Managing file permissions in a secured unix environment and docker is a giant pain.

These make me think that building a native R library is good. Unfortunately for the reticulate path graph-tool is kindof a pain to install as it's mostly c++ code and thus isn't a simple pip install... away. It also is a lot of things other than just SBMs.

Since SBMs are rather simple I feel like building stuff from scratch wouldn't be too hard. I have experimented a bit with this in Javascript (see this messy observable notebook.)

I have also started to build out some RCPP code for a bare-bones network class that has just what is needed to fit SBMs.

My constraints

So far I have done this work in private for a few reasons:

  1. I have lots of collaborations that I'm working on that take up almost all my time and so I didn't want to put something into the public only to have it get stale for a while while I work on custom things only usable for my research.
  2. My research focuses specifically on bipartite SBMs so I was neglecting the non-bipartite situations (luckily this is rather easy to remedy)
  3. I want to graduate soonish and need to put out a couple papers before I do so and the effort-to-paper ratio for these projects is somewhat low.
  4. We've internally been having trouble with how to deal with the stochastic nature of the model in order to deliver stable results. I think from a statistics point of view more effort needs to be put into how to summarize results from the entire chain rather than just picking the maximum posterior value like seems more normal in the network world.
nstrayer commented 5 years ago

Goals

Personally what would be most beneficial for the package is:

  1. Ability to fit bipartite bayesian sbms to simple data (preferably hierarchical)
  2. Results are returned in an actionable non-proprietary format (e.g. node status)
  3. A nice easy to use/ publish visualizations of the results (I think this is almost the only way to think about the results of SBMs but I may be biased.)
  4. Diagnostic tools for looking at convergence of MCMC.
alexpghayes commented 5 years ago

For the past few months I have been off an on working on a SBM package in R internally for my lab (specifically for bipartite sbms).

I would love to see this if that's alright! I'm playing with some campaign finance data (personal project, not research) but have been having a hard time finding packages to fit BiSBMs.

Since SBMs are rather simple I feel like building stuff from scratch wouldn't be too hard. I have experimented a bit with this in Javascript (see this messy observable notebook.)

I think this would be valuable as well, although it would be a significant time investment. There are a lot of different estimation techniques and I don't have a good sense of which ones are more important at the moment.

I have also started to build out some RCPP code for a bare-bones network class that has just what is needed to fit SBMs.

This would be useful to me for both personal and research projects and I would love to coordinate.

My research focuses specifically on bipartite SBMs so I was neglecting the non-bipartite situations (luckily this is rather easy to remedy)

I'm more in the full SBM world, but interested in both.

I want to graduate soonish and need to put out a couple papers before I do so and the effort-to-paper ratio for these projects is somewhat

Agreed. I think starting with minimal infrastructure to facilitate research and building out from that would be ideal. In particular, a standardized class for SBM parameter estimates would be quite valuable to me (I have started on this and can probably share my code).

We've internally been having trouble with how to deal with the stochastic nature of the model in order to deliver stable results. I think from a statistics point of view more effort needs to be put into how to summarize results from the entire chain rather than just picking the maximum posterior value like seems more normal in the network world.

Hmmm this sounds like an interesting problem. Don't have anything here, but will keep an eye out for advice on this, and will ask about this problem (I'm sure it comes up in BART, etc) in the Modern Statistical Workflow Slack.

Ability to fit bipartite bayesian sbms to simple data (preferably hierarchical) Results are returned in an actionable non-proprietary format (e.g. node status) A nice easy to use/ publish visualizations of the results (I think this is almost the only way to think about the results of SBMs but I may be biased.)

Yes yes and yes.

Diagnostic tools for looking at convergence of MCMC.

I'm not using MCMC at the moment, so this is less of a priority for me. However, it would be great to have a fitted SBM class object that worked with both frequentist and bayesian estimators.

alexpghayes commented 5 years ago

Followup question: the difficulty in summarizing the MCMC draws is coming from summarizing the dendogram for a hierarchical SBM, right?

nstrayer commented 5 years ago

I will see how much of the internal stuff I can pull out of our private repo. It works right now by using the graph-tool python package called in a docker container that I start and run with system() calls in R. It's brittle because of how file permissions work in docker.

I put some of the RCPP code in the repo. Like I said it's bare bones right now and I am building it out of what I've done with JavaScript already.

The difficulty with summarizing MCMC comes from the shifting structure, yeah. There are changes not only in cluster membership, but number of clusters and number of levels of hierarchy. I've got some stuff started looking into this that I hope I will be able to turn into a quick paper. Revolving mostly around creating a pairwise distance metric between nodes across runs and then using spectral style methods. It's still in the incubation period though.