Open jowens opened 8 years ago
(This is not something that needs to be addressed in the short term. No hurry.)
Thanks for the detailed description. It is certainly something I would like to work on!
However, I think first I need to do some background reading on a few things to better understand the data I would be analyzing. I understand most of the things you are talking about, but I feel that there are some gaps in my knowledge. Would the papers on Gunrock be a good starting point?
Yes, of course, but in terms of this particular problem, just 5 minutes with me or @yzhwang would explain what you need to know. (This is the heatmap plot data that was in my hacky script.)
Background: We can exhaustively run dobfs with different values of alpha/beta. How do we pick the right alpha/beta given a particular graph? I am not good at this, so I asked @hafen a few months back, who gave me great guidance, then I didn't do anything with it, so I'm posting it here.
@ffarhour I'm assigning it to you, but if you don't get to it this spring, no worries. I just need to write it down here.
Thanks @hafen!
For the simple question, if the simulation is deterministic (same rate for same alpha and beta every time) and you don’t expect any variability in the results, the obvious thing to do of course is to choose the pair of parameters that gives the minimum metric (geometric mean sounds good).
However, if there is variability or if you want to get some insight into how different parameter settings are effecting the result, I’d recommend making some plots. For example, I’d plot rate vs. alpha faceted on beta, with points colored by data set, giving 19 panels. If you can squeeze all 19 panels into one row and still see what is going on, that would be good. When examining a single panel, this will help you see, for a given beta, if the minimum occurs at the edges or within the range the alpha values, etc. You can also see how much variability there is across data sets within each panel. When examining across panels, you can see how the rate behaves in general for different beta. You can make the same plots with the roles of alpha and beta reversed.
If you see enough variability in the plots, you may determine that simply choosing the minimum metric might not be a stable approach (outliers could be chosen as the minimum when the true minimum appears to be somewhere else upon visual inspection, etc.). You can use the plots to help determine whether there should be some smoothing prior to computing the metric. For example, if for a given value of beta, the rate looks like a smooth function of alpha, but the data exhibits a smooth curve plus noise, you can smooth out the noise and use the resulting smooth curve as your data. This could be per data set or across all data sets depending on what the plot looks like.
Hopefully that makes some sense and is going after what you were asking.
For the more complex thing, if I understand correctly, you’d like to, for a given set of characteristics you know about a data set, be able to choose the appropriate alpha and beta without running the simulation. In this case, you can use the 10 data sets you have to build a model. The inputs will be the 10 sets of vertex and edge counts, and the outputs will be the results based on whatever procedure you have followed above to find the best alpha and beta. And you want to train a model on these 10 observations that predicts alpha and beta for a new vertex and edge count. This too will be easiest to approach with some simple plots. For example, plots of vertex count vs. alpha, vertex count vs. beta, edge count vs. alpha, edge count vs. beta. This will help you start to see if there is a clear relationship between pairs of the inputs and outputs and whether it appears that alpha and beta might be modeled independently, and will help determine what kind of model might be appropriate (do relationships look linear?, etc.). At the simplest end of the spectrum, you might find that you can fit a simple model independently for alpha and beta. But it is also possible to model alpha and beta jointly with a multiple dependent variable model. A big issue will be whether the model you fit will be valid when extrapolated beyond the inputs the model has been trained on. You may need more data to train on - perhaps spanning a grid of edge and vertex count values you are interested in.
That’s a bit long winded. Hard to tell you what the right thing is to do without seeing the data, but these are some guidelines.