murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

Hierarchical Clustering of country #24

Closed jisungyoon closed 4 years ago

jisungyoon commented 4 years ago

I tried a hierarchical Clustering of countries. Note that this result comes from discipline-splitted embedding.

The procedure is summarized as follows.

  1. Calculate representative_vector as a mean vector of the whole vectors of the given country.
  2. Pick a country that has more than 100 institutes
  3. Calculate pairwise cosine distance ( 1 - cosine_sim)
  4. Do Hierarchical Clustering

I think it makes sense. Japan is the most isolated country, and Korea, China, US are in the same cluster. How do you think about this result? @yy @murrayds

dendrogram

murrayds commented 4 years ago

I love it! These groups make sense to me, at first glance, though I'd like to see it with more countries. Maybe it makes sense to lower the threshold to ~50 institutions?

jisungyoon commented 4 years ago

I love it! These groups make sense to me, at first glance, though I'd like to see it with more countries. Maybe it makes sense to lower the threshold to ~50 institutions?

Ok, I will expand the number of countries with a threshold that you suggest. And, the upper figure is clustered with 'single linkage method', I will try another linkage method.

jisungyoon commented 4 years ago

dendrogram_ward

jisungyoon commented 4 years ago

dendrogram_ward_50 Threshold with 50

yy commented 4 years ago

Korea-India-Egypt? a bit weird?

jisungyoon commented 4 years ago

Korea-India-Egypt? a bit weird?

I agreed. This is a result with the Ward linkage method.

jisungyoon commented 4 years ago

I will look more deeply next week.

jisungyoon commented 4 years ago

I think it is a problem of a hierarchical clustering algorithm. this cosine distance list of Korea

[('Korea, Republic of', 0.0), ('United States', 0.32237375), ('India', 0.34673756), ('Japan', 0.39938033), ('China', 0.4347207), ('Thailand', 0.4453507), ('Turkey', 0.44664097), ('Egypt', 0.4478628), ('Taiwan, Province of China', 0.45242792), ('Iran, Islamic Republic of', 0.45678532), ('Germany', 0.46245217), ('Czech Republic', 0.47405958), ('Finland', 0.47886485), ('Poland', 0.47896522), ('Hungary', 0.47987974), ('Canada', 0.49002063), ('United Kingdom', 0.49369115), ('Australia', 0.50154287), ('Greece', 0.5104309), ('France', 0.5228225), ('Spain', 0.5243138), ('Norway', 0.5276358), ('Austria', 0.5495703), ('Denmark', 0.5518034), ('Russian Federation', 0.55325913), ('Ireland', 0.5662434), ('Israel', 0.5666102), ('Portugal', 0.5671894), ('Italy', 0.57064867), ('Mexico', 0.5711175), ('Brazil', 0.5753279), ('Netherlands', 0.5785767), ('Romania', 0.57941747), ('South Africa', 0.5849732), ('Sweden', 0.59312737), ('Belgium', 0.59552336), ('Switzerland', 0.5967138)]

jisungyoon commented 4 years ago

This is the same list of china.

[('China', 0.0), ('United States', 0.3076265), ('Canada', 0.3742674), ('Australia', 0.39273417), ('Japan', 0.4018225), ('France', 0.4061414), ('Taiwan, Province of China', 0.40767324), ('United Kingdom', 0.41371167), ('Germany', 0.42083406), ('Korea, Republic of', 0.4347207), ('Thailand', 0.45633352), ('Belgium', 0.4671927), ('Norway', 0.47100806), ('Netherlands', 0.47470915), ('Italy', 0.47490293), ('Egypt', 0.47492677), ('Sweden', 0.47747213), ('Spain', 0.47868967), ('Austria', 0.48570627), ('Russian Federation', 0.48862463), ('Czech Republic', 0.500834), ('India', 0.5012982), ('Israel', 0.5022876), ('Brazil', 0.50393915), ('Denmark', 0.50441635), ('Finland', 0.50582707), ('Poland', 0.50899315), ('Iran, Islamic Republic of', 0.509736), ('Romania', 0.5151825), ('Hungary', 0.51786983), ('Switzerland', 0.5245992), ('Ireland', 0.5246782), ('Turkey', 0.52916855), ('South Africa', 0.5489218), ('Greece', 0.5578825), ('Portugal', 0.5614338), ('Mexico', 0.5638409)]

jisungyoon commented 4 years ago

I think this problem comes from that so many countries are so closed to US. Is there any good method to get a robust cluster with a similarity or distance matrix? A doable method in my mind is a constructing a network with threshold, and do a community detection? Or find core-periphery structure? @yy

yy commented 4 years ago

maybe just cluster without US?

jisungyoon commented 4 years ago

And, interesting facts that there are 1,180 international trajectories of Indian and 686 trajectories contain Korea. Indian actually closes to Korea I think?

jisungyoon commented 4 years ago

This is raw data that the most similar country of each country. Source_country (target_country, cosine_distance)

Egypt ('Canada', 0.36116463) Mexico ('Spain', 0.2994622) Ireland ('United Kingdom', 0.26680315) Thailand ('Japan', 0.34891373) South Africa ('Norway', 0.37570113) Denmark ('Norway', 0.279351) Hungary ('Romania', 0.35187584) Romania ('Hungary', 0.35187584) Israel ('United States', 0.33373773) Austria ('Germany', 0.20575869) Finland ('Sweden', 0.29471815) Greece ('United Kingdom', 0.30813986) Belgium ('Netherlands', 0.2875682) Portugal ('Spain', 0.32566804) Switzerland ('Germany', 0.26431) Czech Republic ('Poland', 0.37411135) Sweden ('Norway', 0.28661156) Taiwan, Province of China ('United States', 0.37452072) Iran, Islamic Republic of ('Canada', 0.3869642) Norway ('Denmark', 0.279351) Poland ('Germany', 0.36685592) Russian Federation ('Germany', 0.36757004) Australia ('United Kingdom', 0.29364806) Netherlands ('Belgium', 0.2875682) India ('Korea, Republic of', 0.34673756) Turkey ('Greece', 0.36510265) Korea, Republic of ('United States', 0.32237375) Canada ('United States', 0.30755192) Japan ('Thailand', 0.34891373) Brazil ('Portugal', 0.3536932) Italy ('United Kingdom', 0.3333258) Spain ('Mexico', 0.2994622) Germany ('Austria', 0.20575869) United Kingdom ('Ireland', 0.26680315) China ('United States', 0.3076265) France ('Belgium', 0.32789463) United States ('Canada', 0.30755192)

jisungyoon commented 4 years ago

dendrogram_ward_50_with_out_USA (1)

I think the cluster becomes more clear without the USA.

jisungyoon commented 4 years ago

Fig update dendrogram_ward_50

jisungyoon commented 4 years ago

Without USA dendrogram_ward_50_with_out_USA (3)

jisungyoon commented 4 years ago

Without USA dendrogram_ward_50_with_out_USA (3)

Cluster become more clear without the USA I think

murrayds commented 4 years ago

And, interesting facts that there are 1,180 international trajectories of Indian and 686 trajectories contain Korea. Indian actually closes to Korea I think?

Could you share how this was calculated? I'm getting some different numbers.

I am finding that there are 14,827 Indian researchers that have an affiliation in at least one other country.

Of these, 1,383, or about 9%

In contrast, about 35% of Indian international researchers have a US affiliation.

Looking into the data, a large proportion of the India <-> Korea flow seems to be from the major IIT and CSIR India to other major Korean universities, about what one would expect.

jisungyoon commented 4 years ago

Dokota and I talked about the data issue yesterday and found that there are some errors in my dataset.

Here are new results:)

jisungyoon commented 4 years ago

dendrogram_ward_50 (1) with USA

jisungyoon commented 4 years ago

dendrogram_ward_50_with_out_USA (4) without USA

jisungyoon commented 4 years ago

The result has been changed, but I think it sill makes sense

yy commented 4 years ago

cool. I think it makes more sense now. it's super interesting to see how Israel is grouped with/without US.

murrayds commented 4 years ago

Amazing—with the new data these better fit my priors. I also love the "commonwealth" group of the UK + S. Africa + AUS + Ireland

jisungyoon commented 4 years ago

cool. I think it makes more sense now. it's super interesting to see how Israel is grouped with/without US.

You mean learn embedding without trajectories with USA?

jisungyoon commented 4 years ago

And, another idea in my mind is the temporal movement of clusters with drawing Alluvial_diagram. https://en.wikipedia.org/wiki/Alluvial_diagram

yy commented 4 years ago

No i was not suggesting anything.

I'd suggest thinking about what's the scope of the paper: i.e. what is the one-sentence summary of the paper? Are we talking about multiple papers or one? What's the message of the figure and how does it serve the message of the paper?

jisungyoon commented 4 years ago

Yeah, I think we need to focus on establishing mobility itself. (for the paper?) I will check the mobility rule on discipline-split embedding. Or, testing radiation model on data.

murrayds commented 4 years ago

Lets keep these results in mind for papers and presentations. But is the issue ready to close? @jisungyoon

jisungyoon commented 4 years ago

I think it is ready to close.