murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

Handdling Organizations with 100% mobility in descriptive figures #59

Closed murrayds closed 4 years ago

murrayds commented 4 years ago

There are many major organizations that we classify as having 100% mobility, even when they consist of many thousands of individuals. For example Indiana University, University of Michigan, UC Berkely.

The reason is that individuals who affiliate with IU Bloomington are 100% classified as also classified as affiliated with the IU UNIV SYSTEM. However, not every IU UNIV SYSTEM individual is classified as affiliated with IU Bloomington. For example, someone from IU Kokomo would not be marked as affiliated with IU Bloomington. Similar issues are also obvious for French and maybe Italian organizations.

For the embeddings, this theoretically shouldn't cause any major issues—thanks to negative sampling, common co-occurrences will appear less often in the training set. However, this does lead to confusion when displaying descriptive statistics because it over-represents organizational mobility. Given this, I propose the following solutions for the descriptive results

  1. Remove these "* UNIV SYSTEM" organizations entirely from the descriptive data, such that someone with only IU Bloomington and IU UNIV SYSTEM will not be calculated as mobile. This however will potentially those individuals affiliated with orgs like IU Kokomo. Requires some manual labor to identify the univ systems.
  2. Try to set up strict precedence, such that the appearance of IU Bloomington means removing IU UNIV SYSTEM. This seems to make the most sense, but also risks losing information (i.e., IUB -> IU Kokomo mobility). Also requires some manual labor.
  3. Remove organizations from the descriptive results that have 100% mobility,, show results for all others. This will, however, remove some big organizations
  4. Leave it as it is, and just explain why the data is like this. This is good because we show descriptive results on the same data as used in the embedding, and it requires the least work. However, it might cause some confusion for the reader.

@yy and @jisungyoon , thoughts?

yy commented 4 years ago

So if you're at IU Kokomo, you'll only appear as "IU UNIV system"? I think the second option makes the most sense. I think we should at least do some robustness analysis at the minimum.

murrayds commented 4 years ago

So if you're at IU Kokomo, you'll only appear as "IU UNIV system"?

Correct—smaller regional universities are just disambiguated into the "UNIV system" category.

We can try re-embedding after implementing option 2 to see if anything fundamentally changes. I doubt much will change, simply because these pairs of organizations account for a small number of total pairs.

murrayds commented 4 years ago

This issue has been satisfied by pull request #60. Precedence rules are now standard in our data,