murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

Factors contributing to proximity—Main text #62

Closed murrayds closed 4 years ago

murrayds commented 4 years ago

In order to determine what factors contribute to our embedding proximity, we could run regressions using the embedding proximity between pairs of organizations as the dependent variable. The initial results are given below.

Global level

emb_proximity ~ geographic_distance + same_continent + same_country + same_language + same_S&T_capacity

Where same_* are operationalized as dummy variables, coding relationships between pairs of organizations.

We see that these together explain about 42% of the variance.

image image

The anova shows that geographic distance explains the greatest variance, followed by being in the same country.

USA only, all organizations

emb+proximity ~ geographic_distance + same_state + same_census_division + same_economic_region + same_type

Where economic_region refers to collections of states smaller than regions (i.e., "Great Leaks", "Northeast", "Southwest") and type refers to the organization type (i.e., "Industry", "Univeristy", "Funding Group", etc.) .

We see that, within the U.S., these factors explain less total variance.

image

As before, the ANOVA mostly shows that geography is important.

image

USA only, universities only

Finally, I limit to universities within the US, which allows rankings to be included in the regression

emb)proximity ~ geographic_distance + same_state + same_census_division + same_economic_region + rank_diff

Where rank difference is coded as the differences between the whole-number Ranks from the Leiden Rankings.

image

With the anova, we again see that geography explains the most variance, followed by difference in rank.

image

@yy @jisungyoon Thoughts on these findings or how to proceed?

jisungyoon commented 4 years ago

Conclusion might be geography is the most important part, but there are multi-factor (prestige, language etc..) which is significant for a regression analysis?

murrayds commented 4 years ago

All mobility, with language family

Here is the global regression with the language family added, operationalized with the dummy variable same_family. By far, geographic distance and same_country dominate, but the other still have a small effect.

image image

Only international mobility

If we limit to only international mobility (i.e., only pairs of orgs in different countries), then we obtain the following results:

image image

Note the very small R2 in the international-only regression. Though here, in explaining the variance we do see that distance, same_continent, and then same_language explain most of the variance.

jisungyoon commented 4 years ago

Can you check the variance inflation factor of each variable in the regression?

murrayds commented 4 years ago

Can you check the variance inflation factor of each variable in the regression?

Yeah—here I am using the implementation of VIF in the car package in R.

The highest VIF are for the global regression. same_country and same_language are a little high, which makes sense given that, since this regression includes all pairs, most mobility is within the same country and therefore has the same language.

image

The VIF for the remaining sets of data (only international, only USA, etc.) are all lower than for the global regression.

jisungyoon commented 4 years ago

Yeah, it is not higher than my expectations. It is fine I guess?

murrayds commented 4 years ago

Yeah, it is not higher than my expectations. It is fine I guess?

I think it's fine—we would expect these variables to be correlated no matter what, and while there appears to be controversy over what a good VIF is, <10 and <5 both seem like common thresholds.

We can just report these in supporting materials.

jisungyoon commented 4 years ago

Yeah, In general, it is 10, 5 is kind of very conservative.

jisungyoon commented 4 years ago

Do you have a plan for this issue @murrayds ? Otherwise, we can close this issue.

murrayds commented 4 years ago

Do you have a plan for this issue @murrayds ? Otherwise, we can close this issue.

We looked into it, but I am unsure of the value, and am unsure that regressions are a good way to understand it.

I say close it and re-open should it be brought up during review.