Organizations along axis — main text figure

murrayds commented 4 years ago

As another example of how we can use the embedding, we can follow the approaches used in the SemAxis and the Journal2Vec papers. Namely, we will form an axis comprised of 2 poles—either individual organizations or mean vectors—and identify where other organizations fall upon this axis.

In order to keep the axes intelligible, and to not spoil future papers too much, we should limit to only organizations within the United States. Possible axes might include:

Axis of East Coast / West Coast, or other geographic orientation (North v South, Midwest vs. Southwest).
Axis of High vs. Low Prestige. use Harvard as one end, and choose a low-prestige organization as the other end
- Can repeat with many low-prestige organizations to evaluate the robustness
Axis of Very Urban (NYC) vs. Very Rural (Grinnel College).
Axis of Government vs Industry

I imagine a figure very much like the Journal2Vec figures—a 1-dimensional line marked where an organization in the US falls upon the axis, with some marked, and with the markings colored by some characteristics such as region (for geographic axes) or prestige.

We should be careful here as well, since our goal is to demonstrate the utility of the embedding rather than conduct a comprehensive analysis. We can save other parts for a future paper.

murrayds commented 4 years ago

I re-worked the SemAxis code slightly to work with our data, and defined several sets of axes on which to project universities within the United States. First draft plots are shown below:

In an attempt to look at "coasts", an axis is defined as the average of California orgs (left), and Massachusetts orgs (right). I highlighted orgs of one neighboring state for each, and show that Arizona tends towards California, whereas Connecticut tends towards Mass.

The circles correspond to the mean similarity between each highlighted group.

I examine prestige by looking at the mean vector of California State orgs (left) and Univ. California orgs (right). The top 10 US World Report ranked universities are plotted, along with another 10 or so sampled from after rank 300 that comprise the nonelite.

Finally, an axis is defined using the sets of elite and non-elite universities mentioned above. I show two states, Indiana and Maryland, on this axis.

@yy @jisungyoon thoughts?

yy commented 4 years ago

Cool! I think elite-others is the most interesting. Can we do 2-d plot of elite - non-elite vs. west-east coast?

jisungyoon commented 4 years ago

Awesome. I am a little bit confused about the elite non-elite group. Does it come from statistics? and how many are there in each group? @murrayds

murrayds commented 4 years ago

Cool! I think elite-others is the most interesting. Can we do 2-d plot of elite - non-elite vs. west-east coast?

Here is a 2d version of the plot. Instead of plotting all orgs, I decided to plot only universities for a subset of states. I also labeled a couple of major universities.

Awesome. I am a little bit confused about the elite non-elite group. Does it come from statistics? and how many are there in each group? @murrayds

I drew from the US News Rankings, which are popular in the U.S. The elite universities are just the top 10 on this list. Non-elite universities were sampled in a non-systematic way, from the lowest universities that had a rank. Note that these lower-ranked universities are largely in the U.S. South and Midwest, so there might be some geographic confounding.

jisungyoon commented 4 years ago

I think geographical confounding is not a problem. Because geography is one of the important factors that determine the elite non-elite university for sure.

It looks cool. Can you add the universities in Cali or Massachusetts?

murrayds commented 4 years ago

It looks cool. Can you add the universities in Cali or Massachusetts?

Here it is, with some additional universities labeled

jisungyoon commented 4 years ago

Where is University of California, Berkeley?

murrayds commented 4 years ago

With Berkely, this time correctly labeled

yy commented 4 years ago

I think it'd be nice to label outliers (top ones, left-most, right-most, etc. maybe not too many bottom ones? 🙄) But I think the sampling method can be an issue. Better way to sample low-prestige places?

murrayds commented 4 years ago

But I think the sampling method can be an issue. Better way to sample low-prestige places?

A more systematic way could be to use the Leiden Ranking data, which is already downloaded and has a host of indicators. We pick one indicator, arrange all univeristeis, and either:

select the top_n and bottom_n universities
select the top n, and then sample from the bottom half of ranked universities
Select or sample from US universities in our data that do not appear in the Leiden Rankings (roughly indicating too little research output to be included).

I think it'd be nice to label outliers (top ones, left-most, right-most, etc. maybe not too many bottom ones?

Maybe something like this, for labels (we should also discuss what states to include or not include):

jisungyoon commented 4 years ago

Or can we sample the lowest university based on region portfolio of the top 10 universities? to remove the region effect

murrayds commented 4 years ago

Or can we sample the lowest university based on region portfolio of the top 10 universities? to remove the region effect

This is a good idea, its the closest we can get to disentangling geography. I'll try it out.

murrayds commented 4 years ago

I worked up a version using @jisungyoon 's sampling method. Good news: nothing changed all that much.

"High Research Impact" unis are the top 20 univeristies based on their normalized citation impact (leiden rankigns).

20 "Lower Research Impact" are sampled bottom-up from low-ranked universities among each region (south, midwest, northeast, west) at the same proportion as the elite universities. For example, 6 elite universities are in the northeast, so we sample the 6 lowest ranked universities in that region.

Pros: This is a much more systematic and justifiable approach, the only thing we have to define ourselves is the total number of universities to sample.

Cons: Leiden Rankings are good, but reflect research output which correlates with, but is not necessarily the same as cultural prestige.

murrayds commented 4 years ago

And a version with centered axes—makes the lines prettier, but ends up squishing the data into a smaller space

jisungyoon commented 4 years ago

I worked up a version using @jisungyoon 's sampling method. Good news: nothing changed all that much.

Yeah, the robustness of results is quite good news! I also like Leiden's ranking, but there are so many criteria and we need to define prestige with a specific data field.

As fas as I know, US-ranking reflects a kind of common sense of prestige, right?

murrayds commented 4 years ago

I also like Leiden's ranking, but there are so many criteria and we need to define prestige with a specific data field.

Here I am using the "impact_frac_mncs" indicator, which seems to be the most reasonable. From their website:

TNCS and MNCS. The total and the average number of citations of the publications of a university, normalized for field and publication year. An MNCS value of two for instance means that the publications of a university have been cited twice above the average of their field and publication year.

As fas as I know, US-ranking reflects a kind of common sense of prestige, right?

Yeah, US News, Times Higher Ed, and Shanghai Rankings all better capture prestige, but no matter which we use, we should consider that:

All rankings are correlated with one another
These rankings often focus on undergrad, not necessarily grad prestige
All rankings are bad

I am happy sticking with the Leiden Rankings. But I am also happy using the US News rankings if we think those better capture the idea of "prestige". We can match them to the Leiden Ranking data, which is only about 200 US universities. Alternatively, we can use the Leiden Rankings and just see if a reviewer complains.

Thoughts?

jisungyoon commented 4 years ago

These rankings often focus on undergrad, not necessarily grad prestige

Yeah, this is quite important things. Most university rankings are highly dependent on the reputation poll. We can do both to check the robustness. how different ranking in US_ranking and Leiden ranking ?

murrayds commented 4 years ago

Here is a quick comparison between defining the axis using Leiden raknigns (left) and Times rankings (right). There are differences, but they don't appear to be major enough to upend any findings.

I think it makes the most sense to stick with the Leiden Rankings for now, since we may draw from their many indicators later. But these same plots with the Times rankings can be included in the supplamentals.

jisungyoon commented 4 years ago

Yeah, just simply measure the rank correlation is enough I guess?

murrayds commented 4 years ago

Yeah, just simply measure the rank correlation is enough I guess?

The Spearman correlation between the Leiden Rankings (using fractional MNCS) and Times Rankings (using their ambiguous total score) is 0.87, with p-value < 2.2e-16, indicating high correlation.

yy commented 4 years ago

what about the correlation between the ranking and the cosine similarity (semaxis value) of institutions?

murrayds commented 4 years ago

what about the correlation between the ranking and the cosine similarity (semaxis value) of institutions?

A rho of 0.8!

This is the spearman correlation between the SemAxis prestige axis (from top 20 elite to bottom 20 geographically-matched set), and the Leiden ranking fractional MNCS, for 123 US organizations which are scored.

We should test this for different numbers of organizations used to compute the axes (i.e., can we use only Harvard and Ashland univeristy to re-create the ranking?), but this is promising!

This seems promising and super interesting!

yy commented 4 years ago

We can do a similar thing with Times ranking (potentially as well as the West-East axis). The outliers may be interesting (super high in US news but low in the semaxis score and vice versa).

murrayds commented 4 years ago

We can do a similar thing with Times ranking (potentially as well as the West-East axis). The outliers may be interesting (super high in US news but low in the semaxis score and vice versa).

While I work on getting more formalized results:

The spearman correlation between the Times org rankings and SemAxis prestige axis (also defined with the times ranking) is 0.82, slightly better than the Leiden Ranking data.

Below are some initial graphs showing the correlation between ranking (x-axis) and the axis similarity (y-axis), with some outliers manually labeled.

As expected, it performs the best at the top and the bottom, with more error around the middle. But to be fair, actual university rankings are pretty bad with middle-ranking universities as well.

Universities that over-perform in our SemAxis ranking tend to have big strengths in specific areas, i.e. George Mason (policy), Georgetown (Law, policy), Yeshiva, Rush, Tulane (Medicine).

jisungyoon commented 4 years ago

Yeah, it is quite interesting results, because it is 100% data-based ranking of the university. And, it is kind of result of the collective intelligence:)

murrayds commented 4 years ago

This figure shows the Spearman's Rho correlation between our SemAxis similarity and the Ranking between the Leiden and Times ranking of ~123 organizations, as a function of the number of orgs used to create the axes. When number is 1, it means that the axes are defined with only the top-ranked and bottom-ranked university; when 10, we use the top ten and bottom ten universities. The x-axis is in increments of 5. There are only 123 ranked universities, so the plot goes up to 60.

As shown, with even just the top and bottom-ranked universities, we can reconstruct a correlated ranking (Rho ~ 0.5). This quickly jumps once we get to the top 5.

jisungyoon commented 4 years ago

Yeah, this is also a good figure for the robustness check of our OrgAxis ranking!

murrayds commented 4 years ago

I've been thinking about figures—I think we can close on these results, sending the message that:

Our proximity picks up on complex latent structures, and
We provide a preview of how these axes/vectors might be used to reason about other organizations, i.e., relating to prestige of other kinds institutions.

The first thing I came up with is a wall of panels showing the organizations for various states. Highlighted points are those in the selected state; grey points are all other universities The bottom row of this figure shows the points for various sectors, i.e., Government organizations, Research Institutes, and Teaching Organizations. Other states/sectors can be moved to the SI.

@yy @jisungyoon thoughts?

yy commented 4 years ago

maybe the wall of panels in supp and more compact figures above go into the main?

murrayds commented 4 years ago

maybe the wall of panels in supp and more compact figures above go into the main?

Do you think that this main figure can work as a standalone (i.e., no additional panels) in the main text? Is there anything else we should show alongside it?

murrayds commented 4 years ago

First draft of a main-draft figure. Left: same SemAxis plot as before. Right: Correlation between ranks derived from SemAxis and Times university ranking. Empty dots are those 20 top and 20 bottom organizations that were used to define the SemAxis poles.

Any changes/things to add?

yy commented 4 years ago

I'll increase the font size (overall) especially the axis labels (California, Elite, ...). The fonts for the number is really small.
I think the state label can be in one line potentially? And match west -> east sequence.
The white dots on the right panel can be probably more effectivley done with greyed-out section (strip?) on the left and on the right.
Weren't you using the Leiden ranking?
"Rho" -> use a symbol or just say PCC?
I think all the annotations are towards the top.
I think we want to flip the x and y. In a sense we're using semaxis to "predict" times ranking. That means we want to put the independent var in the x axis.

murrayds commented 4 years ago

Updated

The white dots on the right panel can be probably more effectivley done with greyed-out section (strip?) on the left and on the right.

I am using both—the grey area to show which rankings were included, but also the white dots because not all universities used to create the mean vector also fell inside the corresponding ranks.

Weren't you using the Leiden ranking?

I sort of changed my mind, thinking that if we want to talk about "prestige", then the Times make more sense. I have the plots for both though, so we can easily swap out/include in supporting materials.

yy commented 4 years ago

Hey why don’t u start sending the draft rather than individual figures?

murrayds commented 4 years ago

Hey why don’t u start sending the draft rather than individual figures?

Good idea! I'll work up the captions in the next few days and send the whole thing out as a draft

murrayds commented 4 years ago

I'm calling this one done with pull request #65 —future discussion will take place in regards to the draft

murrayds / sci-mobility-emb

Organizations along axis — main text figure #51