twitter / GraphJet

GraphJet is a real-time graph processing library.
Apache License 2.0
713 stars 111 forks source link

Jjiang/key metadata pair map #85

Closed jerryjiangabc closed 7 years ago

jerryjiangabc commented 7 years ago

Originally, SmallArrayBasedLongToDoubleMap does not contain duplicate keys, as we only need to record the engagement types in it, and if one user replies or quote a tweet ten times, he/she will only appear in the social proof once. Now, these ten reply edges have different metadata, and GraphJet needs to return all of them as separate social proofs.

This rb changes the dedup logic in SmallArrayBasedLongToDoubleMap from checking keys to checking key, metadata pairs. Along with this change, I add a new public function, uniqueKeysSize() which returns the number of unique keys in SmallArrayBasedLongToDoubleMap. This is needed because clients use the uniqueKeysSize to check the social proof threshold in the algorithms.

In summary, algorithms query uniqueKeysSize() to check against social proof threshold but return the whole array of keys and metadata when hydrating the social proof response.

Along with the above change, we will not apply maxNumSocialProofs cutoff inside TopSecondDegreeByCountForTweet, because it only makes sense to operate on the unique keys. So clients might start to see more social proofs returned from GraphJet than before, which I think is a plus.

jerryjiangabc commented 7 years ago

Just bringing it up in case we need to update or inform the clients. Yes, that is an excellent point. I plan to address this from scala service layer, before returning the response to clients.