Closed murrayds closed 4 years ago
If there is code in the repository to produce a basic figure or the underlying data, I can take lead on making it pretty and combining the sub-figures in adobe illustrator.
I also take this too. I have a legacy code for the last clustering results
1-2 without USA
@yy @murrayds thoughts? or comment?
These look good! I think its worth showing both versions, with and without the US.
Also, we may want to update/simplify the labels, i.e., "Russian Federation" -> "Russia", "Korea, Republic of" -> "South Korea", "Iran, Islamic Republic of" -> "Iran" and "Taiwan, Province of China" -> "Taiwan" (This last one being a sensitive topic..)
How difficult is it to make a clustered similarity matrix, like this:
I think this could be good to add to the plot as well.
Yeah, it is easy, and I have a code for another project:) But I think it is hard to identify the clusters upon the upper figure. I will produce both, and let's compare.
I was thinking that it would make sense to cluster US states as well, which would complement our other US-focused analyses. Maybe even organizations in the same state? (though labelling them would be trickier)
Yeah, I will cover that.
state_level_clustering
To measure the similarity of the clusters (maybe, correlation with geographical effect?) based on census division or economic division, I think we can use the yy's clusim method. Does it make sense to you?
This is results on MS state
These look good! The state-level one is really interesting, having a strong geographic component with few exceptions.
The organization-level one is more difficult to understand, just because there are so many organizations. Maybe we can filter to only universities? (org_type_code == "U"
in the lookup file)
To measure the similarity of the clusters (maybe, correlation with geographical effect?) based on census division or economic division, I think we can use the yy's clusim method. Does it make sense to you?
To clarify, we will compare our hierarchical clustering with the groups defined by the census and economic divisions in order to determine whether these divisions explain our clusters?
Also: in the figures shown here, is the clustering agglomerative or divisive? And what is the linkage being used? Do you think these make sense, or should we explore other clustering parameters? (maybe consistency between them could also be compared with clusim
)
Also, I'm wondering if it makes sense to visualize the dendrogram as a "fan", such that the labels are positioned in a circle like the example below. It's difficult to read the plot with so many labels, so this might help.
likes figure in moral machine paper?
likes figure in moral machine paper?
Yeah, that figure was pretty, but lets not make it a priority. I think after the meeting today, we should focus more on the clusim
approach to seeing what explains the clustering.
likes figure in moral machine paper?
Yeah, that figure was pretty, but lets not make it a priority. I think after the meeting today, we should focus more on the
clusim
approach to seeing what explains the clustering.
Yeah, as we discussed after the class, regression also:)
any comment?
Also, added another color rows with the continent, Is it too messy?
Also, added another color rows with the continent, Is it too messy?
Maybe we keep the continent identifier only, and replace the cluster identifier with some other visual aid? Because the cluster identifier isn't really giving any additional information—its just making it easier to read.
Is something like this possible?
I withdraw the row with colored by the cluster. Instead, I added language results. languages with only one country are colored with bright grey.
As you can see, a pair that merged at the very early stage in the dendrogram shares the language.
Ok, I am liking this! It is interesting that there are virtually no cross-cluster languages, i.e., when countries share a language, they are always in the same cluster.
A few small changes:
Other thoughts for the future:
Ok, I am liking this! It is interesting that there are virtually no cross-cluster languages, i.e., when countries share a language, they are always in the same cluster.
A few small changes:
- Both the "North America" and "Oceania" colors are a little too similar to the "Enligh" color.
- Similarly, the "South America" color is similar to the "no shared language" color.
- And the language color for Portuguese (Brazil and Portugal) is quite similar to that for Dutch (Netherlands and Belgium)
- Can we increase the legend font size, and decreate the number of breaks, i.e., show only [0.3, 0.5, 0.7] in bigger text?
- Does a white border around the cluster groups, rather than a black border, look any better?
Yeah, I will refect comments on next figure:)
- What should we do for countries like South Africa—many of the elite class speaks in English and are likely to be at universities, though English is a minority language.
- Similarly, a big percentage of Algerian's speak French, and it is also the elite language. SHould France and Algeria be considered as sharing the same language?
I think that kind of situation is also interesting. Is there any quantitative evidence of situations likes that? or just statistics also fine.
- A classification of language families (i.e., French + Spanish -> Romance language) could also be nice. Maybe people are more likely to move to linguistically-similar countries.
Yeah! I also have a language families data-set which comes from https://glottolog.org/resource/languoid/id/kore1280 , I will update the dataset later:)
I think that kind of situation is also interesting. Is there any quantitative evidence of situations likes that? or just statistics also fine.
It would be a difficult thing to define. Probably comparing language demographics would be the best way, if we have that data. I.e., if 2 countries each have ~20% of one language, then they can be said to "share" that language.
Yeah! I also have a language families data-set which comes from https://glottolog.org/resource/languoid/id/kore1280 , I will update the dataset later:)
So cool! We probably only need to aggregate to a major group, i.e., Korean -> Koreanic, Spanish -> Italic.
One issue I see though is that language family will tend to be correlated with geography, so maybe they won't tell us much that geography doesn't already.
It would be a difficult thing to define. Probably comparing language demographics would be the best way, if we have that data. I.e., if 2 countries each have ~20% of one language, then they can be said to "share" that language.
Original data have demographic info, but a little bit uncomplete.
One issue I see though is that language family will tend to be correlated with geography, so maybe they won't tell us much that geography doesn't already.
Yeah, I agree. I will upload the data after the cleaning.
I added the result with the language family (Ethnologue), It still tells additional things. (ex. formal-Russian clusters). Do you think we need both?(language and language family)
Fantastic! I think that three sets of categories are too many, and I think that the language family is really interesting, more so than just same language (especially since we can throw same language into a regression model).
So: get rid of same language, but keep the language family identifers
I measured the cluster similarity between hirechcial cluster results and ground-truth (lang, lang_family, continent) I changed the r from -5 to 20, and it says at the very low level of the dendrogram language family of language is an important factor that determines the cluster. But, as you go up, the continent(maybe geographical) is an important factor.
There is a kind of size effect in this result, but it can tell something to us? @yy @murrayds thoughts?
Amazing! Maybe this can be repeated at the US level to? We can use Census Region, Economic Region, and Organization type?
If limited to just universities, we can also include prestige, binned into maybe 4-5 groups (i.e, ranks 1-25, 26-50, 51-75, 71-100, 100-125)
Here is a new version of the country level figure. any comment?
Nice!
My only issue is that the text colors for the 2nd cluster (Finland, Denmark, Norway) and the third cluster (Canada, Iran, etc.) are too similar. Maybe we can swap the the 3rd cluster's light blue color with either the 2nd or 3rd cluster's color? (red -> purple -> blue -> violet -> ...)
Here is a new version of the clustering figure. cluster_country.pdf This is pdf version
Amazing! it looks great. One thing that is more clear with this color scheme ar ethe "horizontal lines" of similarity, especially for the UK, Germany, and France.
Can you elaborate more?
Can you elaborate more?
Yeah, sorry. I just thought it was interesting that we can clearly see how France, Germany, and the UK have high similarity with most countries, just because of their mobility (reflected in the dark horizontal/vertical lines for these countires in the heatmap)
Can you elaborate more?
Yeah, sorry. I just thought it was interesting that we can clearly see how France, Germany, and the UK have high similarity with most countries, just because of their mobility (reflected in the dark horizontal/vertical lines for these countires in the heatmap)
How could we do that? change the color scheme?
No changes necessary! I was just remarking on an intereting pattern, the one highlighted below. I am quite happy with the colors and the overall figure!
No changes necessary! I was just remarking on an intereting pattern, the one highlighted below. I am quite happy with the colors and the overall figure!
Oh, I got it:)
What are your thoughts on where we should go next with this figure?
I think it makes the most sense that we mirror the "zoom-in" from the UMAP projection, showing the heatmap and Clusim values for US states and then organizations within Massachusetts.
Can we create a similar heatmap, but clustered for US states and colored by Census region (Midwest, Northeast, etc.) & sub-region (i.e., "Great Lakes", "New England") at the level of all states, and org type/prestige at the next level?
(also ideal: if you have a flexible script for creating these at the state level, I can automate for all states).
what is economic and census division? (btw typo there) Are they the correct name?
btw, x label is chopped
what is economic and census division? (btw typo there) Are they the correct name?
btw, x label is chopped
We have two classifications of states.
The first classification is an economic division, I don't know where it comes from. Maybe @murrayds know. {'Far West', 'Great Lakes', 'Mideast', 'New England', 'Plains', 'Puerto Rico', 'Rocky Mountain', 'Southeast', 'Southwest'}
The second classification is a division used by the census, and it is called the census region. {'midwest', 'northeast', 'pacific', 'south', 'west'}
Actually, I think we do not need to use all states. This is counter of states. {'Wyoming': 1, 'Vermont': 2, 'West Virginia': 2, 'South Dakota': 2, 'North Dakota': 4, 'Kansas': 4, 'Hawaii': 4, 'Delaware': 4, 'Alaska': 4, 'Idaho': 4, 'New Hampshire': 5, 'Montana': 5, 'Puerto Rico': 5, 'Mississippi': 5, 'Nevada': 5, 'Utah': 6, 'Nebraska': 6, 'Arkansas': 7, 'Iowa': 8, 'Wisconsin': 9, 'South Carolina': 9, 'Maine': 9, 'Kentucky': 10, 'New Mexico': 10, 'Oklahoma': 11, 'Rhode Island': 11, 'Minnesota': 12, 'Oregon': 12, 'Louisiana': 13, 'Washington': 14, 'Tennessee': 14, 'Alabama': 14, 'Indiana': 16, 'Connecticut': 19, 'Missouri': 20, 'Colorado': 21, 'Arizona': 21, 'New Jersey': 23, 'North Carolina': 25, 'Georgia': 27, 'Virginia': 27, 'District of Columbia': 28, 'Michigan': 28, 'Illinois': 35, 'Ohio': 35, 'Florida': 38, 'Maryland': 53, 'Pennsylvania': 57, 'Texas': 66, 'Massachusetts': 69, 'California': 115, 'New York': 120}
Excluding states with less than 5 institutes makes sense to you? @yy @murrayds
Actually, before that, my question would be: is this figure essential to the paper / high priority?
Actually, before that, my question would be: is this figure essential to the paper / high priority?
Our first thinking is making a 3 layer clustering figure(country, states, organization level, similar to the umap_figure), but it is less interesting than our initial thoughts.
The first classification is an economic division, I don't know where it comes from. Maybe @murrayds know.
THese are the "Bureau of Economic Analysis regions" (see wiki)
The Clusim results show basically what's expected—the larger Census regions are most important, followed by the slightly more granular economic divisions—basically, geography matters.
Maybe we include some other organization-level information to the Clusim, such as org type?
Excluding states with less than 5 institutes makes sense to you? @yy @murrayds
I think maybe it's fine to keep them? It shouldn't change the results too much and probably won't make the graph significantly more readable.
Actually, before that, my question would be: is this figure essential to the paper / high priority?
As Jisung said—we want to mirror the "zooming in" of the UMAP projection with a clustering of the world, the US, and then Massachusetts. Specifically, we want to demonstrate that our embedding captures characteristics at multiple scales. It also would make the narrative mesh more nicely. My ideas were like these:
@yy perhaps we can schedule a quick meeting to discuss the structure of a paper, so that we make the more efficient use of our time? Are you available tomorrow (Friday), or early next week?
Now that this is in the draft, I'm calling this issue closed
In response to a discussion between and @jisungyoon , we want to include an example of how our proximity measure can be used for clustering.
I propose a figure with three panels:
This will allow us to demonstrate that a) clustering reveals structure at multiple scales, while b) building on our descriptive findings.
Each of these three panels might include 1) a proximity matrix, and 2) a dendrogram showing the hierarchical clustering.
This figure will appear in the main text, after the UMAP projections.