murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

Add Gravity model #13

Closed murrayds closed 4 years ago

murrayds commented 5 years ago

Construct a gravity model in an attempt to comapre embedding distances with geogrpahic distance

yy commented 5 years ago

First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.

yy commented 5 years ago

if there is a power-law relationship, that suggests the gravity law.

murrayds commented 5 years ago

@yy

First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.

Question: Should the population of an institution be the total number of researchers or the total number of mobile researchers? As of now, the data only contains researchers with > 2 affiliations, but I can get all researchers per institution. I think that both make theoretical sense, but we should choose one.

yy commented 5 years ago

Good question! I think the base model is assuming that the total population is proportional to the mobile population. Then the scaling won’t change.

murrayds commented 5 years ago

Good question! I think the base model is assuming that the total population is proportional to the mobile population. Then the scaling won’t change.

That makes sense! I'll use the mobile researchers for now, and later we can look at all researchers in the dataset (both mobile and not) as a robustness check.

yy commented 5 years ago

But i think it’s good to be aware of this issue. For instance if there is a strong bias (e.g. country x is much more mobile than country y), then it may produce some biases in the results.

murrayds commented 5 years ago

Made a new page to formalize these sorts of issue on the wiki: https://github.com/murrayds/sci-mobility-emb/wiki/Considerations

murrayds commented 5 years ago

First, you can just plot the relationship between the "distance" and P1 x P2 / F12, where P1 and P2 are the populations and F12 is the actual flow.

@yy but what about cases when there is no flow? Would we simply exclude these distances from the calculation, or perhaps set some baseline F12 value?

yy commented 5 years ago

Impute with 1? compare with the case where we ignore them?

murrayds commented 5 years ago

@yy some really preliminary results with the gravity model, comparing logged geographic distance between institutions to P1 P2 / F12 (Left), and cosine similarity from embedding to P1 P2 / F12 (Right). It seemed strange to log the cosine similarity, so I did not, but I am likely wrong. This plot is for all data, with "missing" flows imputed with 1. It is actually a hex-bin plot, rather than a scatterplot, with ~100 bins in each direction.

image

This second is where we ignore missing flows, though there isn't much difference

image

yy commented 5 years ago

How about cosine similarity vs. p1 p2 / f12?

murrayds commented 5 years ago

How about cosine similarity vs. p1 p2 / f12?

woops, the graph on the right is mislabeled—it ia actually P1 * P2 / F12. I included it again below, with the correct y-axis label. I'll work on making it more pretty/clear tomorrow

image

yy commented 5 years ago
jisungyoon commented 5 years ago

Or, if an exponential distribution is true, we can apply the exponential gravity model https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1538-4632.1975.tb01023.x

It is simple, the distance-decay function is an exponential function(e^-r) instead of power_law_function(r^-d)

murrayds commented 5 years ago

Second iteration. Gravity now defined as F12 / P1 * P2. Gravity is logged (x-axis). Geographic distance is logged (y-axis, left). For discussion on very small distances, see issue #18.

Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.

image

Also, you can create bins in the cosine similarity and calculate & display the mean and the standard error for the log ratio.

I wasn't sure if I understood completely, but this is what I came up with:

image

X-axis is the binned cosine similarity, 20 increments. Note that cosine similarity can be negative; because we are calculating log ratios, I excluded negative non-zero similarities.

Y-axis is the log ratio of the cosine similarity (cos(theta)) over the gravity (F12 / P1 * P2). Note that, unlike in the first figure, here I use the raw, non-logged value.

Points map to the mean value of the log-ratio within each bin. Error bars correspond to the mean +- the standard error.

jisungyoon commented 5 years ago

Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.

Is there any specific reason that you tried the quadratic function?

And, can you also show the R^2 and coefficient?

murrayds commented 5 years ago

Red line is a linear model, y ~ x, whereas the dashed black line is a quadratic curve, y ~ x + x^2. Both are fitted to the log-transformed data.

Is there any specific reason that you tried the quadratic function?

And, can you also show the R^2 and coefficient?

here it is again. The relationship looked somewhat quadratic to me, but it really isn't, especially after looking at the R2. Below I plot just the linear model and show the R2 on the plot

image

jisungyoon commented 5 years ago

Amazing!

I checked several empirical studies with the gravity model and R^2=0.41 is pretty high I guess. Also, It seems likes distance-decay function is an exponential form.

murrayds commented 5 years ago

Also, It seems likes distance-decay function is an exponential form.

Can you elaborate on this? Is the exponential form of the distance-decay function (y ~ exp(e^-r) equivalent or not equivalent to the linear model on the log-transformed data (log(y) ~ x, as in the previous plot)?

I tried fitting a non-linear decay function to the data (included here), and while I get something that looks sort of like a fit, it is difficult to evaluate. The linear model gives us an R2 which makes things a little more clear.

image

jisungyoon commented 5 years ago

Can you elaborate on this? Is the exponential form of the distance-decay function (y ~ exp(e^-r) equivalent or not equivalent to the linear model on the log-transformed data (log(y) ~ x, as in the previous plot)?

Same. If the data has a linear relationship on log-transformed y and linear scale x. It has an exponential distribution. log (y) = ax + b y = e^(ax+b)

The exponential gravity model is simple F_12 ~ P1 P2 e^(-r) In our cosine similarity, log(F12/(P1*P2) and linear cosine similarity have a linear relationship, and it might fit in exponential gravity model I think.

murrayds commented 5 years ago

Updated fit with cleaned data and updated geographic coordinates. The results didn't change too much, which is good! Embedding distance still seems to explain actual flow better than geographic distance.

image
jisungyoon commented 5 years ago

And, another idea in my mind is finding a strong relationship between two institutions. If we think the expected flow from the gravity model as a null model, we can find a strong interaction with Actual_flow/expected_flow. Can we try this idea?

murrayds commented 5 years ago

And, another idea in my mind is finding a strong relationship between two institutions. If we think the expected flow from the gravity model as a null model, we can find a strong interaction with Actual_flow/expected_flow. Can we try this idea?

I like this idea, a way of identifying institutions that are more strongly related than we would expect. I will work up a visualization of this when I get back from Michigan.

Also, we can extend this to different scales, for groups of institutions, comparing the distribution of expected flows vs. the distribution of actual flows between two cities/regions/countries. For example, maybe Boston and Seoul have more-than-expected flow, whereas Boston and Beijing do not.

yy commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

jisungyoon commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

Yeah, it could be a link prediction problem, I will think about this problem. It might be a temporal link prediction problem in an evolving network.

murrayds commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

Yeah, it could be a link prediction problem, I will think about this problem. It might be a temporal link prediction problem in an evolving network.

Would it be simpler to model it as a linear regression, using the actual flow as the response variable, and geographic_distance + embedding_distance (and other variables) as predictors?

Most pairs of organizations have at least one person connecting them, so predicting the expected number of flow between two institutions seems more relevant.

yy commented 4 years ago

Probably not the vanilla linear regression because we want to inform the model with the functional form (gravity law) that we have clues.

jisungyoon commented 4 years ago

Probably not the vanilla linear regression because we want to inform the model with the functional form (gravity law) that we have clues.

I think the main reason gravity model with embedding distance fits well is that embedding distance can catch an international flow well. Then, how about looking into the correlation between geographical distance and embedding distance first?

yy commented 4 years ago

I feel like all of these can be the next step after establishing the gravity law (or any other law).

murrayds commented 4 years ago

Made a little progress on the plot with binned cosine similarity. Here I used 20 bins, and plot the mean and 99th confidence intervals for the logged gravity value within each bin.

image

jisungyoon commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. Screen Shot 2019-11-18 at 11 52 04 AM For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?

murrayds commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. Screen Shot 2019-11-18 at 11 52 04 AM For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?

Could you elaborate on what a meta-info term is?

jisungyoon commented 4 years ago

Another way to think about is just building an accurate mobility prediction model. How well can we predict if we include the geography, embedding, and other information?

There are a few kinds of research on mobility prediction in traffic engineering. They usually use one gravitation term and meta-info term on the model. Like this way. Screen Shot 2019-11-18 at 11 52 04 AM For an accurate mobility prediction model, we can use two gravity terms and meta-info terms. Does it make sense?

Could you elaborate on what a meta-info term is?

It may depend on the data-set. For our case, we can use the dummy variable that if two institutes come from a country that uses the same language, set x=1, otherwise x=0. It can be any interaction term between two institutes

murrayds commented 4 years ago

That sounds interesting, especially in light of these plots. The embedding space seems to perform better at representing actual flows than does geographic distance not only globally, but also within countries and regions.

image

The below plot includes only paris of organizations that are within the same country.

image

And within the same region

image

It also performs well at within-city mobility, but the quality of geographic distance is quite poor at that level.

We can also look at pairs of organizations that are in -different- countries, where geographic distance seems to completely fail.

image

And different regions image

and different cities

image

So, in sum, the embedding distance is fairly robust at multiple scales, seemingly more so than geographic distances.

jisungyoon commented 4 years ago

Amazing works, and especially on international scales, it seems like embedding distance works very well (almost 4.8 times)!!

murrayds commented 4 years ago

Here is an updated version of the gravity vs. distance plot—thoughts?

image
yy commented 4 years ago

pretty cool

murrayds commented 4 years ago

How about somehting like this? smaller plot area, bigger font, no hex boundaries. Latex font is a little tricky with ggplot, so while I can't get the exact same natural font, I can italicize the equation. If need be, we can edit the font in illustrator later.

image