Closed murrayds closed 4 years ago
- what are the two rows?
If we get coefficients from the regression model, we can calculate the expected flow from the gravity model. In the upper figure, x values are expected value from the obtained model, and y values are value from real-data. If the model is perfect, all the data points should be located on y=x.
The first row is a result with geographic distance and the second row is a result with embedding distance
what does the red and blue mean?
We can calculate 25th and 75th percentile of each bin, and if this range touches the y=x, we can conclude that the model is good enough. In that sense, red means it failed to touch the y=x, and blue means it touches y=x which means the model is good.
This way is exactly the same way on Barabasi's radiation model, but the percentile is 9th to 91st percentile. We think they tuning the parameters to looks nice.
And, I agreed with YY, we need to make the plot as a square plot.
And, I had a short meeting with Inho, first authors on Korean Bus Network paper. He said that R^2 on Predicted vs. Actual plot is not a good measurement of the gravity model, R^2 on the previous plot is import measure for the gravity model.
the box is too clunky. what's the circle inside the box? mean? It can be a dot rather than a dot and a circle
Below are a couple of versions that we can pursue:
Both the Radiation Model and the Korean Bus Network papers used these boxplots, but its hard to make them look nice. below is with some changes to color that might highlight key findings.
Here, the black dot is the mean, the black bar in the box is the median, the upper and lower edges of the box are the 25th and 75th percentile, and the whiskers extend to the 9th and 91st percentiles.
Blue indicates that the prediction is "good", defined by the x = y line crossing between the 25th and 75th percentile
Another option is a point + whisker plot—it obfuscates the distribution somewhat, but is a little cleaner
Another altenrative would to simply remove the scatterplots and stick with the binned measures, either a boxplot or a point + whisker. Scatterplot of predicted vs. actual could be included in supporting materials.
I experimented with a hexbin and heatmap, but they were too busy to interpret.
I think approach2 is better?
Actually i'd vote for the first one. We can use the same 9th and 91th to color the bin following the barabasi paper. The box will still show the quartiles. We can use even more transparency for the grey dots. I think we probably want bigger fonts.
Actually i'd vote for the first one. We can use the same 9th and 91th to color the bin following the barabasi paper. The box will still show the quartiles. We can use even more transparency for the grey dots. I think we probably want bigger fonts.
So the Korean Bus Network paper colors by the 25th and 75th percentile. Coloring here by the 9th and 91st percentile, as in the Barabasi paper, might be too inclusive, and would show nearly everything as blue, as in the two plots below (top for geo distance global, bottom for cosine similarity global)
I think we can see more clear patterns on the embedding based gravity model, but 91th and 9th percentile make it look similar. I asked Inho, and he said that he followed the Barabasi paper, but reduced the percentile because it was too broad.
Here is an updated version of the figure.
The black circles represent the mean; following the barabasi and bus network paper, I made them empty circles, which doesn't obscure the median or bevels
It is a minor thing, but we need to change the axis ( y value as a model value, x value as actual(from data) @murrayds See,
No! we want prediction as the x-axis.
No! we want prediction as the x-axis.
Is it better? In general mobility paper, they plot predictionval as y-axis. Even in Radiation model paper
Perhaps we think of it like this: which sentence sounds better:
prediction on x-axis: "In the first bin of the predicted value, the actual values were between a and b"
prediction on y-axis: "In the first bin of the actual values, the predicted values were between a and b".
Here is what it would look like if predicted is on the y-axis, and actual is on the x-axis
I think it is not a big problem, but predicted as x-axis looks prettier
x-axis it is
I think the convention is almost always, across fields: independent variable (x) -> dependent variable (y).
Based on this convention, it is much more natural to have the prediction as the "independent variable" and the actual as the "dependent variable".
Here is a first draft of the predicted vs. actual plots—note that the top labels won't appear in the final figure. Top = predicted with geographic distance; bottom = predicted with cosine similarity. Points represent pairs of organizations. The black diagonal line is the x=y line. 30 bins are drawn (small tends to obfuscate where cosine similarity does better). Boxes are colored blue if the x = y line falls between the 25th and 75th percentile (as in the Korean Bus Network paper). Whiskers on the boxplot extend to the 9th and 91st percentile. The line in the boxes represents the median. The black dots overlaid onto each box is the mean for actual fluxes in that bin.
@yy and @jisungyoon , thoughts?
and a close-up of the global cosine-similarity plot