Closed murrayds closed 4 years ago
Conclusion might be geography is the most important part, but there are multi-factor (prestige, language etc..) which is significant for a regression analysis?
Here is the global regression with the language family added, operationalized with the dummy variable same_family
. By far, geographic distance and same_country
dominate, but the other still have a small effect.
If we limit to only international mobility (i.e., only pairs of orgs in different countries), then we obtain the following results:
Note the very small R2 in the international-only regression. Though here, in explaining the variance we do see that distance, same_continent, and then same_language explain most of the variance.
Can you check the variance inflation factor of each variable in the regression?
Can you check the variance inflation factor of each variable in the regression?
Yeah—here I am using the implementation of VIF in the car
package in R.
The highest VIF are for the global regression. same_country
and same_language
are a little high, which makes sense given that, since this regression includes all pairs, most mobility is within the same country and therefore has the same language.
The VIF for the remaining sets of data (only international, only USA, etc.) are all lower than for the global regression.
Yeah, it is not higher than my expectations. It is fine I guess?
Yeah, it is not higher than my expectations. It is fine I guess?
I think it's fine—we would expect these variables to be correlated no matter what, and while there appears to be controversy over what a good VIF is, <10 and <5 both seem like common thresholds.
We can just report these in supporting materials.
Yeah, In general, it is 10, 5 is kind of very conservative.
Do you have a plan for this issue @murrayds ? Otherwise, we can close this issue.
Do you have a plan for this issue @murrayds ? Otherwise, we can close this issue.
We looked into it, but I am unsure of the value, and am unsure that regressions are a good way to understand it.
I say close it and re-open should it be brought up during review.
In order to determine what factors contribute to our embedding proximity, we could run regressions using the embedding proximity between pairs of organizations as the dependent variable. The initial results are given below.
Global level
emb_proximity ~ geographic_distance + same_continent + same_country + same_language + same_S&T_capacity
Where
same_*
are operationalized as dummy variables, coding relationships between pairs of organizations.We see that these together explain about 42% of the variance.
The anova shows that geographic distance explains the greatest variance, followed by being in the same country.
USA only, all organizations
emb+proximity ~ geographic_distance + same_state + same_census_division + same_economic_region + same_type
Where
economic_region
refers to collections of states smaller than regions (i.e., "Great Leaks", "Northeast", "Southwest") andtype
refers to the organization type (i.e., "Industry", "Univeristy", "Funding Group", etc.) .We see that, within the U.S., these factors explain less total variance.
As before, the ANOVA mostly shows that geography is important.
USA only, universities only
Finally, I limit to universities within the US, which allows rankings to be included in the regression
emb)proximity ~ geographic_distance + same_state + same_census_division + same_economic_region + rank_diff
Where rank difference is coded as the differences between the whole-number Ranks from the Leiden Rankings.
With the anova, we again see that geography explains the most variance, followed by difference in rank.
@yy @jisungyoon Thoughts on these findings or how to proceed?