Use K-means clustering instead of what you have currently

twitter / the-algorithm

Source code for Twitter's Recommendation Algorithm

https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm

GNU Affero General Public License v3.0

62.53k stars 12.16k forks source link

Use K-means clustering instead of what you have currently #1618

Open ecoates-bc opened 1 year ago

ecoates-bc commented 1 year ago

So I've had a look around this repo and it looks like there are a LOT of files in it.

That's not necessarily a bad thing! But maybe if you trimmed your recommender system down a little, it would help your servers' overhead, and that way save you folks a lot of money.

So, I think you should just use k-means clustering instead of whatever spaghetti you have in here currently.

Why K-means clustering?

There are a lot of reasons why you'd use k-means clustering in this case.

It's unsupervised. Don't have to worry about feature engineering when there are no features!
It's fast. There's a lot to be said about a mean-and-lean approach to data science, especially on the web.
You can choose your own k. That gives you a lot of flexibility in training the best algorithm for your users' needs.

Conclusion

I think you should use K-means clustering for your project "The Algorithm." Let me know how it goes!

dclipca commented 1 year ago

It's possible but what if what they got is better than vanilla K-means in terms of the recommendation outcome? I like the idea of choosing your own recommendation algorithm/outcome.

amicus-veritatis commented 1 year ago

Adopting pure, vanilla K-means clusterning means elon musk's tweets are not prioritized, so I don't think they can do it.

ecoates-bc commented 1 year ago

Adopting pure, vanilla K-means clusterning means elon musk's tweets are not prioritized, so I don't think they can do it.

That's a good point! Maybe he could get his own little cluster

stealthpaladin commented 1 year ago

It's possible but what if what they got is better than vanilla K-means in terms of the recommendation outcome? I like the idea of choosing your own recommendation algorithm/outcome.

hmmm that's very true, you would not get away with an out-of-the-box, everyday K-means implementation and retain the same degree of being able to tune it on-the-fly per user. I'm sure there could be an implementation that preserved this, maybe client-side post-processing of clusters. But yea either way definitely would not want to sacrifice tuning in order to achieve an implementation.

Cap-ten commented 1 year ago

As it is right now, there is no point talking about K-means or similar. That problem comes later.

A cook can only do so much with said ingredients not matter the process the cook choses.

A like is an input. A like + others who liked the same is another. The time the like happened is another. The difference of the time of the like for the same thing of various users matters. The time of a user spending on said thing that was liked matters. This is probably silly but it goes deeper. The main problem atm is Data. There are tons of possibilities to get Data even with the current architecture. Once Data value is satisfying, then Algorithmic process will be more exciting ( more difficult than it is ). Not even the current version of transformers will be able to give accurate outputs ( in relation with user's well being and market wanting to invest with advertisements).

wiseaidev commented 1 year ago

I agree with your sentiments expressed here. But, K-means presents major drawbacks:

Strong sensitivity to outliers like Republicans.
Computationally expensive for large datasets like Twitter data as k becomes large.
Doesn’t guarantee to converge to a global minimum. It is sensitive to the centroids’ initialization(e.g. Different setups may lead to different results).
Works on numerical data only, and doesn't support categorical data.
Fails to give good quality clustering for a group of points that have non-convex shapes.