I like the clustering approach, but I don't like that k-means makes you say up front how many clusters there's going to be (i'm discovering too, it's a new day, i don't know yet, right??). I want to experiment with other clustering algorithms that make different assumptions and trade-offs about the data.
DBSCAN seems interesting because it finds clusters based on density. So you have to say what the expected density should be, that threshold that defines a cluster.
I expect that there will be a lot of tweaking to make it work for a certain embedding model, but after you get it to work it'll be a lot more dynamic and robust.
Note: DBSCAN doesn't assign all posts to a cluster, so you might not be able to use the toot_clusters.html on it's own. You'll probably need an offshoot of it. Feel free to skip this part on the first pass of the PR, we might even be able to get someone else to do this part.
I like the clustering approach, but I don't like that k-means makes you say up front how many clusters there's going to be (i'm discovering too, it's a new day, i don't know yet, right??). I want to experiment with other clustering algorithms that make different assumptions and trade-offs about the data.
DBSCAN seems interesting because it finds clusters based on density. So you have to say what the expected density should be, that threshold that defines a cluster.
I expect that there will be a lot of tweaking to make it work for a certain embedding model, but after you get it to work it'll be a lot more dynamic and robust.
Note: DBSCAN doesn't assign all posts to a cluster, so you might not be able to use the
toot_clusters.html
on it's own. You'll probably need an offshoot of it. Feel free to skip this part on the first pass of the PR, we might even be able to get someone else to do this part.