murrayds / sci-mobility-emb

Embedding of scientific mobility across institutions, cities, regions, and countries
4 stars 0 forks source link

Implement fractional and full org flow counts #28

Closed murrayds closed 4 years ago

murrayds commented 4 years ago

For each pair of consecutive mobility events, define all combinations of time period one (t1) to time period two (t2). In a full counting scheme, these can be weighted the same. In a fractional scheme, they can be weighted by the share of total combinations.

For example,

t1(A) -> t2(B) t1(A) -> t2(A, B, C) gives

t1(A, B, C) -> t2(D) gives

t1(A, B) -> t2(B, C, D)

These will be used to calculate organization flows for the gravity model, which will be compared against distances in the embedding space

murrayds commented 4 years ago

Thinking about this some more and having some doubts. Imagine the following trajectory;

t1(A) -> t2(B) -> t3(A) -> t4(B)

There is no way of knowing whether the author left from A to B at t2 or at t4, or whether they just maintained two affiliations throughout, but just didn't publish much. Would we include two instances of traj(A, B) and traj(B, A) when counting flows?

Another one:

t1(A, B, C, D) -> t2(A)

Here, would we include trajectories traj(B, A), traj(C, A), traj(D, A), even though A was in t1?

Another edge case:

t1(A, B) -> t2(A) -> t3(B)...

Under a directional model, would we include traj(A, B) even though the appeared in the same (first) time period?

I am uncomfortable with the number of choices we might have to make in order to impose directionality on the data. Maybe the simple co-occurrence model is sufficient? Or perhaps we could modify the model to make more sense given the nature of the data? Or do you all have ideas for adding directionality in a way that is theoretically sound?

murrayds commented 4 years ago

@yy @jisungyoon, interested in your thoughts on this

jisungyoon commented 4 years ago

I also think about the issue. It is a very tricky problem in terms of

  1. People can affiliate in multiple institutes. (It does not happen in real human mobility)
  2. Sometimes, they do not write their all affiliations. (Maybe, this problem is related to funding issue or etc)

Also, I asked my friend about the gravity model. In general, people use directional flow. This problem makes our problem difficult to apply the original gravity model directly and I realized that the way that suggested before (fractional number) is not good methods after playing around with data. Because word2vec does not work like that

I think co-occurrence is good enough, but we need to specify how we define a flow between two institutes in the paper.

yy commented 4 years ago

Is it crucial to think about directionality? Given that we currently don't have a good idea/understanding of the directional embedding, I don't think it's worthwhile (atm) to digging into this. The simplest approach is just setting the time window in terms of # of papers, and then just use all pairs of affiliation to run the skip-gram model I think?

jisungyoon commented 4 years ago

Is it crucial to think about directionality? Given that we currently don't have a good idea/understanding of the directional embedding, I don't think it's worthwhile (atm) to digging into this. The simplest approach is just setting the time window in terms of # of papers, and then just use all pairs of affiliation to run the skip-gram model I think?

Do you mean that set window size of maximum length of sentences?

jisungyoon commented 4 years ago

Why I tried to define a direction was a hypothesis of another mobility model, the radiation model is asymmetric of flow. Sorry Let's focus on the gravity model for now.

yy commented 4 years ago

Is it crucial to think about directionality? Given that we currently don't have a good idea/understanding of the directional embedding, I don't think it's worthwhile (atm) to digging into this. The simplest approach is just setting the time window in terms of # of papers, and then just use all pairs of affiliation to run the skip-gram model I think?

Do you mean that set window size of maximum length of sentences?

Nope.

yy commented 4 years ago

Why I tried to define a direction was a hypothesis of another mobility model, the radiation model is asymmetric of flow. Sorry Let's focus on the gravity model for now.

But even if you create directional trajectories, it's totally unclear how you get to the asymmetric distance?

jisungyoon commented 4 years ago

Why I tried to define a direction was a hypothesis of another mobility model, the radiation model is asymmetric of flow. Sorry Let's focus on the gravity model for now.

But even if you create directional trajectories, it's totally unclear how you get to the asymmetric distance?

In the radiation model, we don't need an asymmetric distance. Asymmetric flows come from topology of the neighborhood.

murrayds commented 4 years ago

Why I tried to define a direction was a hypothesis of another mobility model, the radiation model is asymmetric of flow. Sorry Let's focus on the gravity model for now.

But even if you create directional trajectories, it's totally unclear how you get to the asymmetric distance?

I think in this case, we are talking only about calculating flows between organizations, i.e., Fij / P1 * P2. Calculating embedding distance with symmetric trajectories will remain the same. However, we were concerned because the traditional gravity and radition models assume a directional flow (i.e., a person moves -to- Boston -from- Bloomingtom).

I think focusing on co-occurence is fine for the gravity model. While the data doesn't fit the standard format for the model, it still tells us something about co-affiliations within a time period.

I am not sure how this would impact the radiation model as I haven't spent as much time looking into it, so Jisun is right that we should focus on the gravity model for the time being.

murrayds commented 4 years ago

This direction is no longer being pursued, so the issue will be closed