nicholas-leonard / equanimity

Experimental research for distributed conditional computation
4 stars 0 forks source link

Clustering sentences into expert datasets #40

Closed nicholas-leonard closed 10 years ago

nicholas-leonard commented 10 years ago

We perform pre-clustering of sentences on a subset of the available data. Using our python/postgreSQL script, we can perform this hierarchical clustering by first generating a graph of similarity arrows between sentence nodes. We end up with a balanced tree where each leaf is a sentence, one of the levels is used to map experts to a cluster of sentences. The tree of nodes encompassed from the root to the expert level are used to fix the hierarchical representation of the experts in the gater. This hierarchy will not change its structure, yet the distribution of examples among its leafs will change during the mapping phase. This will allow the gater to reuse its non-leaf output weights from cycle to cycle.

nicholas-leonard commented 10 years ago

To do this we need a table mapping each sentence to its set of words. We use this table to generate the similarity arrows. We use the arrows to perform the clustering of sentences. We use the clusters to train the experts. We train a gater to optimize the use of the trained experts. We use the trained gater to distribute all examples to experts.

nicholas-leonard commented 10 years ago

Finished clustering 1M sentences into 100k clusters.

nicholas-leonard commented 10 years ago

Done everything.