murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

ORCID Validation #58

Closed murrayds closed 4 years ago

murrayds commented 4 years ago

The ORCID data has been validated at the following path: /Users/dakotamurray/Dropbox/SME-dropbox/data/raw/2018_ORCID_DATA.txt

Data Overview

The data looks something like this:

orcid role_title start_date org city region country
0000-0001-5000-0031 Senior Researcher 2016 The 3.0 Co., Ltd Seoul NULL KR
0000-0001-5000-0031 Research Associate 2012 Seoul National University Research Institute of Basic Sciences Seoul NULL KR

The data consists of 2,038,684 records accounting for the stated affiliations of 1,183,743 unique ORCID IDS created as of some point in 2018. Of these, 374,608 individuals have at least two affiliations listed, though many of these will be multiple positions at the same organization (i.e., Instructor -> Professor), or different departments.

The good things about the data:

The downsides of this data are:

Validation approach

We can use this information to conduct a sort of validation of our results. However, we cannot do so at the level of organizations since that information is not well disambiguated, and there is no matching to the WoS dataset. However, we can compare between cities.

Limit to only Cities in the U.S., as this should be some of the most detailed data in either dataset and because many US universities exist in places where there is only one university anyway. There also appear to be city-name disambiguation issues in ORCID, so this limits the amount of work we have to do.

Calculate flux between cities in our own data, and also in the ORCID data. Using our data, fit a gravity model with the embedding distance and compute the predicted vs. actual—this will be our baseline. Then, we will fit a gravity model using the ORCID flux, and again attempt to predict. Ideally, the predictions will be close to our baseline.

murrayds commented 4 years ago

I have been thinking about this, and I see a clear problem.

Since there does not seem to be a matching between ORCID and WoS organizations, we cannot make predictions at the organizational level.

I originally proposed making predictions at the city-level. However, upon thinking about this, I realize that it would require either learning a new embedding or aggregating our current embedding at the level of cities.

Given this issue, I see three paths forward:

  1. learning a new embedding model at the level of cities
  2. Calculating mean vectors for our city, and using proximity between mean vectors to make predictions. However, I worry that this might be too messy and not give us an adequate validation. Is this worry unfounded?
  3. Drop the ORCID validation entirely

@yy thoughts?

yy commented 4 years ago

maybe put it as a backburner and focus on draft

murrayds commented 4 years ago

maybe put it as a backburner and focus on draft

I'm happy with that—it looks like it will be a little complicated/take some work.

@jisungyoon just a quick run of numbers, it looks like the ORCID data has ~8,500 unique organizations listed in their data for the United States. Its manageable, especially if we do some string-matching, but would still be a lot of work for potentially not much gain.

jisungyoon commented 4 years ago

Yeah, as yy suggested, let's focus on draft:)

murrayds commented 4 years ago

After thinking about this for some time, I think the ORCID data is not comparable to our large-scale mobility data, simply not enough coverage, not representative enough, and in too different a format.

We can leave this oone for now, and simply talk about past validations of the data