Closed murrayds closed 4 years ago
First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.
if there is a power-law relationship, that suggests the gravity law.
@yy
First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.
Question: Should the population of an institution be the total number of researchers or the total number of mobile researchers? As of now, the data only contains researchers with > 2 affiliations, but I can get all researchers per institution. I think that both make theoretical sense, but we should choose one.
Good question! I think the base model is assuming that the total population is proportional to the mobile population. Then the scaling won’t change.
Good question! I think the base model is assuming that the total population is proportional to the mobile population. Then the scaling won’t change.
That makes sense! I'll use the mobile researchers for now, and later we can look at all researchers in the dataset (both mobile and not) as a robustness check.
But i think it’s good to be aware of this issue. For instance if there is a strong bias (e.g. country x is much more mobile than country y), then it may produce some biases in the results.
Made a new page to formalize these sorts of issue on the wiki: https://github.com/murrayds/sci-mobility-emb/wiki/Considerations
First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.
@yy but what about cases when there is no flow? Would we simply exclude these distances from the calculation, or perhaps set some baseline F12 value?
Impute with 1? compare with the case where we ignore them?
@yy some really preliminary results with the gravity model, comparing logged geographic distance between institutions to P1 P2 / F12 (Left), and cosine similarity from embedding to P1 P2 / F12 (Right). It seemed strange to log the cosine similarity, so I did not, but I am likely wrong. This plot is for all data, with "missing" flows imputed with 1. It is actually a hex-bin plot, rather than a scatterplot, with ~100 bins in each direction.
This second is where we ignore missing flows, though there isn't much difference
How about cosine similarity vs. p1 p2 / f12?
How about cosine similarity vs. p1 p2 / f12?
woops, the graph on the right is mislabeled—it ia actually P1 * P2 / F12. I included it again below, with the correct y-axis label. I'll work on making it more pretty/clear tomorrow
Or, if an exponential distribution is true, we can apply the exponential gravity model https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1538-4632.1975.tb01023.x
It is simple, the distance-decay function is an exponential function(e^-r) instead of power_law_function(r^-d)
Second iteration. Gravity now defined as F12 / P1 * P2. Gravity is logged (x-axis). Geographic distance is logged (y-axis, left). For discussion on very small distances, see issue #18.
Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.
Also, you can create bins in the cosine similarity and calculate & display the mean and the standard error for the log ratio.
I wasn't sure if I understood completely, but this is what I came up with:
X-axis is the binned cosine similarity, 20 increments. Note that cosine similarity can be negative; because we are calculating log ratios, I excluded negative non-zero similarities.
Y-axis is the log ratio of the cosine similarity (cos(theta)) over the gravity (F12 / P1 * P2). Note that, unlike in the first figure, here I use the raw, non-logged value.
Points map to the mean value of the log-ratio within each bin. Error bars correspond to the mean +- the standard error.
Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.
Is there any specific reason that you tried the quadratic function?
And, can you also show the R^2 and coefficient?
Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.
Is there any specific reason that you tried the quadratic function?
And, can you also show the R^2 and coefficient?
here it is again. The relationship looked somewhat quadratic to me, but it really isn't, especially after looking at the R2. Below I plot just the linear model and show the R2 on the plot
Amazing!
I checked several empirical studies with the gravity model and R^2=0.41 is pretty high I guess. Also, It seems likes distance-decay function is an exponential form.
Also, It seems likes distance-decay function is an exponential form.
Can you elaborate on this? Is the exponential form of the distance-decay function (y ~ exp(e^-r) equivalent or not equivalent to the linear model on the log-transformed data (log(y) ~ x, as in the previous plot)?
I tried fitting a non-linear decay function to the data (included here), and while I get something that looks sort of like a fit, it is difficult to evaluate. The linear model gives us an R2 which makes things a little more clear.
Can you elaborate on this? Is the exponential form of the distance-decay function (y ~ exp(e^-r) equivalent or not equivalent to the linear model on the log-transformed data (log(y) ~ x, as in the previous plot)?
Same. If the data has a linear relationship on log-transformed y and linear scale x. It has an exponential distribution. log (y) = ax + b y = e^(ax+b)
The exponential gravity model is simple F_12 ~ P1 P2 e^(-r) In our cosine similarity, log(F12/(P1*P2) and linear cosine similarity have a linear relationship, and it might fit in exponential gravity model I think.
Updated fit with cleaned data and updated geographic coordinates. The results didn't change too much, which is good! Embedding distance still seems to explain actual flow better than geographic distance.
And, another idea in my mind is finding a strong relationship between two institutions. If we think the expected flow from the gravity model as a null model, we can find a strong interaction with Actual_flow/expected_flow. Can we try this idea?
And, another idea in my mind is finding a strong relationship between two institutions. If we think the expected flow from the gravity model as a null model, we can find a strong interaction with Actual_flow/expected_flow. Can we try this idea?
I like this idea, a way of identifying institutions that are more strongly related than we would expect. I will work up a visualization of this when I get back from Michigan.
Also, we can extend this to different scales, for groups of institutions, comparing the distribution of expected flows vs. the distribution of actual flows between two cities/regions/countries. For example, maybe Boston and Seoul have more-than-expected flow, whereas Boston and Beijing do not.
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
Yeah, it could be a link prediction problem, I will think about this problem. It might be a temporal link prediction problem in an evolving network.
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
Yeah, it could be a link prediction problem, I will think about this problem. It might be a temporal link prediction problem in an evolving network.
Would it be simpler to model it as a linear regression, using the actual flow as the response variable, and geographic_distance + embedding_distance (and other variables) as predictors?
Most pairs of organizations have at least one person connecting them, so predicting the expected number of flow between two institutions seems more relevant.
Probably not the vanilla linear regression because we want to inform the model with the functional form (gravity law) that we have clues.
Probably not the vanilla linear regression because we want to inform the model with the functional form (gravity law) that we have clues.
I think the main reason gravity model with embedding distance fits well is that embedding distance can catch an international flow well. Then, how about looking into the correlation between geographical distance and embedding distance first?
I feel like all of these can be the next step after establishing the gravity law (or any other law).
Made a little progress on the plot with binned cosine similarity. Here I used 20 bins, and plot the mean and 99th confidence intervals for the logged gravity value within each bin.
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?
Could you elaborate on what a meta-info term is?
Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?
There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?
Could you elaborate on what a meta-info term is?
It may depend on the data-set. For our case, we can use the dummy variable that if two institutes come from a country that uses the same language, set x=1, otherwise x=0. It can be any interaction term between two institutes
That sounds interesting, especially in light of these plots. The embedding space seems to perform better at representing actual flows than does geographic distance not only globally, but also within countries and regions.
The below plot includes only paris of organizations that are within the same country.
And within the same region
It also performs well at within-city mobility, but the quality of geographic distance is quite poor at that level.
We can also look at pairs of organizations that are in -different- countries, where geographic distance seems to completely fail.
And different regions
and different cities
So, in sum, the embedding distance is fairly robust at multiple scales, seemingly more so than geographic distances.
Amazing works, and especially on international scales, it seems like embedding distance works very well (almost 4.8 times)!!
Here is an updated version of the gravity vs. distance plot—thoughts?
pretty cool
How about somehting like this? smaller plot area, bigger font, no hex boundaries. Latex font is a little tricky with ggplot, so while I can't get the exact same natural font, I can italicize the equation. If need be, we can edit the font in illustrator later.
Construct a gravity model in an attempt to comapre embedding distances with geogrpahic distance