ORCID Validation - Githubissues

murrayds commented 4 years ago

The ORCID data has been validated at the following path: /Users/dakotamurray/Dropbox/SME-dropbox/data/raw/2018_ORCID_DATA.txt

Data Overview

The data looks something like this:

orcid	role_title	start_date	org	city	region	country
0000-0001-5000-0031	Senior Researcher	2016	The 3.0 Co., Ltd	Seoul	NULL	KR
0000-0001-5000-0031	Research Associate	2012	Seoul National University Research Institute of Basic Sciences	Seoul	NULL	KR

The data consists of 2,038,684 records accounting for the stated affiliations of 1,183,743 unique ORCID IDS created as of some point in 2018. Of these, 374,608 individuals have at least two affiliations listed, though many of these will be multiple positions at the same organization (i.e., Instructor -> Professor), or different departments.

The good things about the data:

It should be more precise, people manually enter their affiliations
It is sequentially ordered, we can easily order by start-date
It can be easily enriched with other ORCID data, such as works
Includes information on position (Professor, Researcher, instructor, etc.)

The downsides of this data are:

Not nearly as much of it. Compare to many millions of identifiable mobile researchers in the wider WoS data
Likely not representative. ORCID ids have still not been adopted widely
Relies on human-entered data, so may still include errors
Similarly, may not capture the more messy aspects of mobility, i.e. people include their primary affiliation on ORCID but perhaps not their second & third affiliations, or as a visiting scholar, etc.
It seems like some organization-level disambiguation issues

Validation approach

We can use this information to conduct a sort of validation of our results. However, we cannot do so at the level of organizations since that information is not well disambiguated, and there is no matching to the WoS dataset. However, we can compare between cities.

Limit to only Cities in the U.S., as this should be some of the most detailed data in either dataset and because many US universities exist in places where there is only one university anyway. There also appear to be city-name disambiguation issues in ORCID, so this limits the amount of work we have to do.

Calculate flux between cities in our own data, and also in the ORCID data. Using our data, fit a gravity model with the embedding distance and compute the predicted vs. actual—this will be our baseline. Then, we will fit a gravity model using the ORCID flux, and again attempt to predict. Ideally, the predictions will be close to our baseline.

murrayds commented 4 years ago

I have been thinking about this, and I see a clear problem.

Since there does not seem to be a matching between ORCID and WoS organizations, we cannot make predictions at the organizational level.

I originally proposed making predictions at the city-level. However, upon thinking about this, I realize that it would require either learning a new embedding or aggregating our current embedding at the level of cities.

Given this issue, I see three paths forward:

learning a new embedding model at the level of cities
Calculating mean vectors for our city, and using proximity between mean vectors to make predictions. However, I worry that this might be too messy and not give us an adequate validation. Is this worry unfounded?
Drop the ORCID validation entirely

@yy thoughts?

yy commented 4 years ago

maybe put it as a backburner and focus on draft

murrayds commented 4 years ago

maybe put it as a backburner and focus on draft

I'm happy with that—it looks like it will be a little complicated/take some work.

@jisungyoon just a quick run of numbers, it looks like the ORCID data has ~8,500 unique organizations listed in their data for the United States. Its manageable, especially if we do some string-matching, but would still be a lot of work for potentially not much gain.

jisungyoon commented 4 years ago

Yeah, as yy suggested, let's focus on draft:)

murrayds commented 4 years ago

After thinking about this for some time, I think the ORCID data is not comparable to our large-scale mobility data, simply not enough coverage, not representative enough, and in too different a format.

We can leave this oone for now, and simply talk about past validations of the data

murrayds / sci-mobility-emb

ORCID Validation #58

Data Overview

Validation approach