Closed murrayds closed 4 years ago
I have been thinking about this, and I see a clear problem.
Since there does not seem to be a matching between ORCID and WoS organizations, we cannot make predictions at the organizational level.
I originally proposed making predictions at the city-level. However, upon thinking about this, I realize that it would require either learning a new embedding or aggregating our current embedding at the level of cities.
Given this issue, I see three paths forward:
@yy thoughts?
maybe put it as a backburner and focus on draft
maybe put it as a backburner and focus on draft
I'm happy with that—it looks like it will be a little complicated/take some work.
@jisungyoon just a quick run of numbers, it looks like the ORCID data has ~8,500 unique organizations listed in their data for the United States. Its manageable, especially if we do some string-matching, but would still be a lot of work for potentially not much gain.
Yeah, as yy suggested, let's focus on draft:)
After thinking about this for some time, I think the ORCID data is not comparable to our large-scale mobility data, simply not enough coverage, not representative enough, and in too different a format.
We can leave this oone for now, and simply talk about past validations of the data
The ORCID data has been validated at the following path:
/Users/dakotamurray/Dropbox/SME-dropbox/data/raw/2018_ORCID_DATA.txt
Data Overview
The data looks something like this:
The data consists of 2,038,684 records accounting for the stated affiliations of 1,183,743 unique ORCID IDS created as of some point in 2018. Of these, 374,608 individuals have at least two affiliations listed, though many of these will be multiple positions at the same organization (i.e., Instructor -> Professor), or different departments.
The good things about the data:
The downsides of this data are:
Validation approach
We can use this information to conduct a sort of validation of our results. However, we cannot do so at the level of organizations since that information is not well disambiguated, and there is no matching to the WoS dataset. However, we can compare between cities.
Limit to only Cities in the U.S., as this should be some of the most detailed data in either dataset and because many US universities exist in places where there is only one university anyway. There also appear to be city-name disambiguation issues in ORCID, so this limits the amount of work we have to do.
Calculate flux between cities in our own data, and also in the ORCID data. Using our data, fit a gravity model with the embedding distance and compute the predicted vs. actual—this will be our baseline. Then, we will fit a gravity model using the ORCID flux, and again attempt to predict. Ideally, the predictions will be close to our baseline.