Clustering figure - Githubissues

murrayds commented 4 years ago

In response to a discussion between and @jisungyoon , we want to include an example of how our proximity measure can be used for clustering.

I propose a figure with three panels:

Clustering of global countries
Clustering of regions within one country (US?)
Clustering of Organizations within one state (Boston, California, or New York?, maybe all three, with only one in main text)

This will allow us to demonstrate that a) clustering reveals structure at multiple scales, while b) building on our descriptive findings.

Each of these three panels might include 1) a proximity matrix, and 2) a dendrogram showing the hierarchical clustering.

This figure will appear in the main text, after the UMAP projections.

murrayds commented 4 years ago

If there is code in the repository to produce a basic figure or the underlying data, I can take lead on making it pretty and combining the sub-figures in adobe illustrator.

jisungyoon commented 4 years ago

I also take this too. I have a legacy code for the last clustering results

jisungyoon commented 4 years ago

Clustering of global countries (countries that have more than 25 institutes) 1-1. with USA

dendrogram_ward_25 (1)

1-2 without USA dendrogram_ward_25_with_out_USA

@yy @murrayds thoughts? or comment?

murrayds commented 4 years ago

These look good! I think its worth showing both versions, with and without the US.

Also, we may want to update/simplify the labels, i.e., "Russian Federation" -> "Russia", "Korea, Republic of" -> "South Korea", "Iran, Islamic Republic of" -> "Iran" and "Taiwan, Province of China" -> "Taiwan" (This last one being a sensitive topic..)

How difficult is it to make a clustered similarity matrix, like this:

I think this could be good to add to the plot as well.

jisungyoon commented 4 years ago

Yeah, it is easy, and I have a code for another project:) But I think it is hard to identify the clusters upon the upper figure. I will produce both, and let's compare.

murrayds commented 4 years ago

I was thinking that it would make sense to cluster US states as well, which would complement our other US-focused analyses. Maybe even organizations in the same state? (though labelling them would be trickier)

jisungyoon commented 4 years ago

Yeah, I will cover that.

jisungyoon commented 4 years ago

state_dendrogram_ward_25 state_level_clustering

jisungyoon commented 4 years ago

To measure the similarity of the clusters (maybe, correlation with geographical effect?) based on census division or economic division, I think we can use the yy's clusim method. Does it make sense to you?

jisungyoon commented 4 years ago

MS_dendrogram_ward This is results on MS state

murrayds commented 4 years ago

These look good! The state-level one is really interesting, having a strong geographic component with few exceptions.

The organization-level one is more difficult to understand, just because there are so many organizations. Maybe we can filter to only universities? (org_type_code == "U" in the lookup file)

To measure the similarity of the clusters (maybe, correlation with geographical effect?) based on census division or economic division, I think we can use the yy's clusim method. Does it make sense to you?

To clarify, we will compare our hierarchical clustering with the groups defined by the census and economic divisions in order to determine whether these divisions explain our clusters?

Also: in the figures shown here, is the clustering agglomerative or divisive? And what is the linkage being used? Do you think these make sense, or should we explore other clustering parameters? (maybe consistency between them could also be compared with clusim)

murrayds commented 4 years ago

Also, I'm wondering if it makes sense to visualize the dendrogram as a "fan", such that the labels are positioned in a circle like the example below. It's difficult to read the plot with so many labels, so this might help.

jisungyoon commented 4 years ago

likes figure in moral machine paper?

murrayds commented 4 years ago

likes figure in moral machine paper?

Yeah, that figure was pretty, but lets not make it a priority. I think after the meeting today, we should focus more on the clusim approach to seeing what explains the clustering.

jisungyoon commented 4 years ago

likes figure in moral machine paper?

Yeah, that figure was pretty, but lets not make it a priority. I think after the meeting today, we should focus more on the clusim approach to seeing what explains the clustering.

Yeah, as we discussed after the class, regression also:)

jisungyoon commented 4 years ago

nation_cluster any comment?

jisungyoon commented 4 years ago

Screen Shot 2020-02-06 at 12 53 35 PM Also, added another color rows with the continent, Is it too messy?

murrayds commented 4 years ago

Also, added another color rows with the continent, Is it too messy?

Maybe we keep the continent identifier only, and replace the cluster identifier with some other visual aid? Because the cluster identifier isn't really giving any additional information—its just making it easier to read.

Is something like this possible?

jisungyoon commented 4 years ago

nation_cluster (1) I withdraw the row with colored by the cluster. Instead, I added language results. languages with only one country are colored with bright grey.

As you can see, a pair that merged at the very early stage in the dendrogram shares the language.

murrayds commented 4 years ago

Ok, I am liking this! It is interesting that there are virtually no cross-cluster languages, i.e., when countries share a language, they are always in the same cluster.

A few small changes:

Both the "North America" and "Oceania" colors are a little too similar to the "Enligh" color.
Similarly, the "South America" color is similar to the "no shared language" color.
And the language color for Portuguese (Brazil and Portugal) is quite similar to that for Dutch (Netherlands and Belgium)
Can we increase the legend font size, and decreate the number of breaks, i.e., show only [0.3, 0.5, 0.7] in bigger text?
Does a white border around the cluster groups, rather than a black border, look any better?

Other thoughts for the future:

What should we do for countries like South Africa—many of the elite class speaks in English and are likely to be at universities, though English is a minority language.
Similarly, a big percentage of Algerian's speak French, and it is also the elite language. SHould France and Algeria be considered as sharing the same language?
A classification of language families (i.e., French + Spanish -> Romance language) could also be nice. Maybe people are more likely to move to linguistically-similar countries.

jisungyoon commented 4 years ago

Ok, I am liking this! It is interesting that there are virtually no cross-cluster languages, i.e., when countries share a language, they are always in the same cluster.

A few small changes:

Both the "North America" and "Oceania" colors are a little too similar to the "Enligh" color.

Similarly, the "South America" color is similar to the "no shared language" color.

And the language color for Portuguese (Brazil and Portugal) is quite similar to that for Dutch (Netherlands and Belgium)

Can we increase the legend font size, and decreate the number of breaks, i.e., show only [0.3, 0.5, 0.7] in bigger text?

Does a white border around the cluster groups, rather than a black border, look any better?

Yeah, I will refect comments on next figure:)

What should we do for countries like South Africa—many of the elite class speaks in English and are likely to be at universities, though English is a minority language.

Similarly, a big percentage of Algerian's speak French, and it is also the elite language. SHould France and Algeria be considered as sharing the same language?

I think that kind of situation is also interesting. Is there any quantitative evidence of situations likes that? or just statistics also fine.

A classification of language families (i.e., French + Spanish -> Romance language) could also be nice. Maybe people are more likely to move to linguistically-similar countries.

Yeah! I also have a language families data-set which comes from https://glottolog.org/resource/languoid/id/kore1280 , I will update the dataset later:)

murrayds commented 4 years ago

I think that kind of situation is also interesting. Is there any quantitative evidence of situations likes that? or just statistics also fine.

It would be a difficult thing to define. Probably comparing language demographics would be the best way, if we have that data. I.e., if 2 countries each have ~20% of one language, then they can be said to "share" that language.

Yeah! I also have a language families data-set which comes from https://glottolog.org/resource/languoid/id/kore1280 , I will update the dataset later:)

So cool! We probably only need to aggregate to a major group, i.e., Korean -> Koreanic, Spanish -> Italic.

One issue I see though is that language family will tend to be correlated with geography, so maybe they won't tell us much that geography doesn't already.

jisungyoon commented 4 years ago

It would be a difficult thing to define. Probably comparing language demographics would be the best way, if we have that data. I.e., if 2 countries each have ~20% of one language, then they can be said to "share" that language.

Original data have demographic info, but a little bit uncomplete.

One issue I see though is that language family will tend to be correlated with geography, so maybe they won't tell us much that geography doesn't already.

Yeah, I agree. I will upload the data after the cleaning.

jisungyoon commented 4 years ago

nation_cluster (2) I added the result with the language family (Ethnologue), It still tells additional things. (ex. formal-Russian clusters). Do you think we need both?(language and language family)

murrayds commented 4 years ago

Fantastic! I think that three sets of categories are too many, and I think that the language family is really interesting, more so than just same language (especially since we can throw same language into a regression model).

So: get rid of same language, but keep the language family identifers

jisungyoon commented 4 years ago

Screen Shot 2020-02-07 at 10 44 06 AM I measured the cluster similarity between hirechcial cluster results and ground-truth (lang, lang_family, continent) I changed the r from -5 to 20, and it says at the very low level of the dendrogram language family of language is an important factor that determines the cluster. But, as you go up, the continent(maybe geographical) is an important factor.

There is a kind of size effect in this result, but it can tell something to us? @yy @murrayds thoughts?

murrayds commented 4 years ago

Amazing! Maybe this can be repeated at the US level to? We can use Census Region, Economic Region, and Organization type?

If limited to just universities, we can also include prestige, binned into maybe 4-5 groups (i.e, ranks 1-25, 26-50, 51-75, 71-100, 100-125)

jisungyoon commented 4 years ago

nation_cluster (3) Here is a new version of the country level figure. any comment?

murrayds commented 4 years ago

Nice!

My only issue is that the text colors for the 2nd cluster (Finland, Denmark, Norway) and the third cluster (Canada, Iran, etc.) are too similar. Maybe we can swap the the 3rd cluster's light blue color with either the 2nd or 3rd cluster's color? (red -> purple -> blue -> violet -> ...)

jisungyoon commented 4 years ago

Here is a new version of the clustering figure. cluster_country.pdf This is pdf version

murrayds commented 4 years ago

Amazing! it looks great. One thing that is more clear with this color scheme ar ethe "horizontal lines" of similarity, especially for the UK, Germany, and France.

jisungyoon commented 4 years ago

Can you elaborate more?

murrayds commented 4 years ago

Can you elaborate more?

Yeah, sorry. I just thought it was interesting that we can clearly see how France, Germany, and the UK have high similarity with most countries, just because of their mobility (reflected in the dark horizontal/vertical lines for these countires in the heatmap)

jisungyoon commented 4 years ago

Can you elaborate more?

Yeah, sorry. I just thought it was interesting that we can clearly see how France, Germany, and the UK have high similarity with most countries, just because of their mobility (reflected in the dark horizontal/vertical lines for these countires in the heatmap)

How could we do that? change the color scheme?

murrayds commented 4 years ago

No changes necessary! I was just remarking on an intereting pattern, the one highlighted below. I am quite happy with the colors and the overall figure!

jisungyoon commented 4 years ago

No changes necessary! I was just remarking on an intereting pattern, the one highlighted below. I am quite happy with the colors and the overall figure!

Oh, I got it:)

murrayds commented 4 years ago

What are your thoughts on where we should go next with this figure?

I think it makes the most sense that we mirror the "zoom-in" from the UMAP projection, showing the heatmap and Clusim values for US states and then organizations within Massachusetts.

Can we create a similar heatmap, but clustered for US states and colored by Census region (Midwest, Northeast, etc.) & sub-region (i.e., "Great Lakes", "New England") at the level of all states, and org type/prestige at the next level?

(also ideal: if you have a flexible script for creating these at the state level, I can automate for all states).

jisungyoon commented 4 years ago

test

jisungyoon commented 4 years ago

clustering_state_heatmap_part test_2

yy commented 4 years ago

what is economic and census division? (btw typo there) Are they the correct name?

btw, x label is chopped

jisungyoon commented 4 years ago

what is economic and census division? (btw typo there) Are they the correct name?

btw, x label is chopped

We have two classifications of states.

The first classification is an economic division, I don't know where it comes from. Maybe @murrayds know. {'Far West', 'Great Lakes', 'Mideast', 'New England', 'Plains', 'Puerto Rico', 'Rocky Mountain', 'Southeast', 'Southwest'}

The second classification is a division used by the census, and it is called the census region. {'midwest', 'northeast', 'pacific', 'south', 'west'}

jisungyoon commented 4 years ago

Actually, I think we do not need to use all states. This is counter of states. {'Wyoming': 1, 'Vermont': 2, 'West Virginia': 2, 'South Dakota': 2, 'North Dakota': 4, 'Kansas': 4, 'Hawaii': 4, 'Delaware': 4, 'Alaska': 4, 'Idaho': 4, 'New Hampshire': 5, 'Montana': 5, 'Puerto Rico': 5, 'Mississippi': 5, 'Nevada': 5, 'Utah': 6, 'Nebraska': 6, 'Arkansas': 7, 'Iowa': 8, 'Wisconsin': 9, 'South Carolina': 9, 'Maine': 9, 'Kentucky': 10, 'New Mexico': 10, 'Oklahoma': 11, 'Rhode Island': 11, 'Minnesota': 12, 'Oregon': 12, 'Louisiana': 13, 'Washington': 14, 'Tennessee': 14, 'Alabama': 14, 'Indiana': 16, 'Connecticut': 19, 'Missouri': 20, 'Colorado': 21, 'Arizona': 21, 'New Jersey': 23, 'North Carolina': 25, 'Georgia': 27, 'Virginia': 27, 'District of Columbia': 28, 'Michigan': 28, 'Illinois': 35, 'Ohio': 35, 'Florida': 38, 'Maryland': 53, 'Pennsylvania': 57, 'Texas': 66, 'Massachusetts': 69, 'California': 115, 'New York': 120}

jisungyoon commented 4 years ago

Excluding states with less than 5 institutes makes sense to you? @yy @murrayds

yy commented 4 years ago

Actually, before that, my question would be: is this figure essential to the paper / high priority?

jisungyoon commented 4 years ago

Actually, before that, my question would be: is this figure essential to the paper / high priority?

Our first thinking is making a 3 layer clustering figure(country, states, organization level, similar to the umap_figure), but it is less interesting than our initial thoughts.

murrayds commented 4 years ago

The first classification is an economic division, I don't know where it comes from. Maybe @murrayds know.

THese are the "Bureau of Economic Analysis regions" (see wiki)

The Clusim results show basically what's expected—the larger Census regions are most important, followed by the slightly more granular economic divisions—basically, geography matters.

Maybe we include some other organization-level information to the Clusim, such as org type?

Excluding states with less than 5 institutes makes sense to you? @yy @murrayds

I think maybe it's fine to keep them? It shouldn't change the results too much and probably won't make the graph significantly more readable.

Actually, before that, my question would be: is this figure essential to the paper / high priority?

As Jisung said—we want to mirror the "zooming in" of the UMAP projection with a clustering of the world, the US, and then Massachusetts. Specifically, we want to demonstrate that our embedding captures characteristics at multiple scales. It also would make the narrative mesh more nicely. My ideas were like these:

We demonstrate that our embedding similarity better models mobility than does geographic distance
Embeddings capture a range of factors, exhibited descriptively by our UMAP projections
The specific factors at play—geography, language, sector—have different levels of importance, and vary across levels of analysis.
At the level of an individual country, our embedding captures latent aspects of mobility, such as hierarchies.

@yy perhaps we can schedule a quick meeting to discuss the structure of a paper, so that we make the more efficient use of our time? Are you available tomorrow (Friday), or early next week?

murrayds commented 4 years ago

Now that this is in the draft, I'm calling this issue closed

murrayds / sci-mobility-emb

Clustering figure #50