tdunning / knn

Large scale k-nn experiments
http://mahout.mapr.com
68 stars 21 forks source link

Need generator for nominal data #3

Closed tdunning closed 12 years ago

tdunning commented 12 years ago

The input here would be a frequency distribution. The samples would be drawn from the discrete distribution specified as the input.

Some ways to describe the discrete distribution include:

a) using a multiset containing counts

b) using a map containing counts

c) using a specification of a long-tailed distribution of some kind. One option would be to specify a power law and vocabulary size.

d) using a chinese restaurant process or Pittman-Yar process. See http://en.wikipedia.org/wiki/Chinese_restaurant_process and http://en.wikipedia.org/wiki/Pitman%E2%80%93Yor_process

Note that the chinese restaurant process generates symbols from an infinite vocabulary so the assumption of a finite output set should not be built into the system.

tdunning commented 12 years ago

generator.Multinomial does this now. You can pass in a Multimap and it will match the frequencies you give. You can also give real valued probabilities. This handles (a) and (b).

Item (d) is being replaced by an Indian Buffer process since that will give better document surrogates. The Indian Buffet process should be suitable for (c) as well.