Open mpadge opened 4 years ago
Hi @mpadge !
I would love to open a dialog about making the c++ side of things more standard. sbmR has been my very first serious introduction to c++ and so, as a result, it is a window into my evolving understanding of the language. Poor design decisions early in the process certainly steered the code to where it is now.
Once I have the bandwidth to do it I am looking at refactoring the c++ side of things to find hopefully a better balance of enough generality to enable these types of interoperability and also enough specificity to not make any sacrifices on the speed front.
For instance, I don't really make any use of the built-in SEXP wrappers from RCPP because I have a two-stage development process of c++ build and test and then the R build and test. So I can't rely on any RCPP functions on the c++ side of things because when I do the C++ testing I don't have access to them.
I notice in your code you make good use of these which obviously has the benefits of less copying etc. but I wonder how you balance this with testing on the c++ side of things?
Initialization in sbmR is done via an agglomerative merging algorithm that's rather efficient and works pretty well (for simulated data at least). No edge collapsing is done. That's absolutely an area that I want to dig further into though as - due to the energy landscape of these discrete processes - you want to make sure you're initial state is pretty darn good.
Thanks for your quick response Nick. I noticed your very independent C++ vs R structure, extending to stand-alone C++ test suite, which is actually really impressive. My approach is generally the opposite of that, through ensuring that the entire C++ code base of any package can be traversed by testing only exported R functions. The only advantage of my approach is that the entire (R and C++) code base can be tested via a single R CMD check
call. The advantage of your approach is its powerful modularity - your entire C++ code can simply be lifted out in a fully-tested state and ported elsewhere with almost no effort, which is actually really impressive, and potentially really useful.
spatialcluster
implements both agglomerative merging and dis-agglomerative cutting algorithms (the latter being tree-cutting algorithms in traditional applications, modified here for the full-edge-relationship case), with a variety of options to control which is used. I reckon a good starting point might be for me to re-ignite my endeavours on this package as soon as i can, and do that by revisiting the core algorithms in conjunction with a perusal of your merging algorithm. That should give me a much better understanding of how we might both strive for maximal compatibility and opportunity for co-development.
That pretty much amounts to me accepting for the time being that the ball is in my proverbial court, and i'll get back to you anon, once I've dug sufficiently deeply into your code. Sound good?
Finally, one thing you might get out of that may be insights into how to extend your analyses to continuous edge weights. I note your statement in your talk that you only support discrete edge weights, whereas spatialcluster
is actually primarily intended to be applied to matrices of covariances (or correlations) between nodes.
By the way, I note that you're not part of rOpenSci's slack group - would you be interested in joining? It's hugely useful as a place to ask and almost always get very quick, very useful answers to questions; it's full of all sorts of unexpected insights into R, packages, and random stuff, and it has hundreds of amazing people. Let me know if you're keen, and I'll arrange for an invitation to be sent your way. That could also provide a more casual forum for us to exchange ideas along the way.
@mpadge ,
When starting the project my intention was to write a stand-alone c++ library that I could use in R but also in JavaScript as I have been wanting an excuse to dig into web assembly. Also, there are a lot of interesting network visualization/ investigations tools in JavaScript that I am much more comfortable with than the corresponding R versions. This may have been a premature optimization on my part.
One thing to note is that I am in the process of refactoring the c++ side of things due to some annoying bug tracking I've done recently and in an attempt to remove any cruft. (I've been reading the Bjarne Stroustrup book "A Tour of C++" and getting all sorts of ideas.) This work is ongoing in the cpp_refactoring
branch.
I would love to be part of the rOpenSci slack! All the people I know who are part of the community are absolutely amazing. In addition, I am working in relative isolation on the software engineering side of things so I think I selfishly could benefit a lot from having people to discuss things with.
Coincidence: https://twitter.com/bikesRdata/status/1232383846689247232. And yeah, the portability for web assembly is a very strong argument for your way of doing things.
@nstrayer I just watched your excellent rstudio:conf 2020 talk on your
sbmR
package - fabulous piece of work there! I'd like to ensure maximal interoperability and compatibility between this here package of mine and any and all other analogous packages, such as yours and thesabre
package by @nowosad (via #7). I'm thinking, at least initially, in particular of ensuring maximal interoperability between data representations, so that we might both gain from mutual development. Both your package and mine are primarily C++ based, so the R representation is not likely to be directly relevant, but potential transferability of C++ code could be important.Your C++ interface is quite specific to your workflow, whereas I've kept mine intentionally very simple (example here), mostly via simple
data.frame
representations of edge properties. I would at least be interested in discussing ways to reconcile these two representations, and ultimately to enable direct plug-in capabilities of relevant parts of our C++ codebases.There may be other mutual benefits beyond that, potentially including methods for generating initial clustering (block) estimates.
spatialcluster
also implements what I'm calling "full" or "complete" clustering, which uses all edges between all nodes to generate initial estimates, rather than "conventional" approaches which reduce edges to some suitably reduced or minimal set (via minimal spanning trees or whatever). I'm not sure how your initial estimates are generated, but there could be scope for overlap and mutual benefit there too.Note that
spatialcluster
is currently semi-dormant, but I definitely aim to resurrect as soon as I can, and in particular to get it on CRAN.