probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
1 stars 1 forks source link

Bigram distribution over strings #7

Closed alex-lew closed 1 month ago

alex-lew commented 1 month ago

Once we have addressed #3, we will want to add a simple distribution over string-valued data.

I suggest the following setup:

The state we track would be a matrix of observed transition counts -- how often did we transition from letter i to letter j, for each letter?

Actually, I think it should be possible to implement this internally in terms of a vector of Dirichlet-Categorical distributions.

Joaoloula commented 1 month ago

This could be fixed, or an input to the distribution's constructor, as k is in Emily's implementation of Dirichlet-Categorical

I like the latter option---would give us more flexibility to bridge between ascii, unicode, token vocabularies etc

$i=V+1$ as a special 'start symbol' and $j=V+1$ as a special 'end symbol'.

maybe a typo and you mean i=0, or i=V?

Actually, I think it should be possible to implement this internally in terms of a vector of Dirichlet-Categorical distributions.

sounds good to me!

alex-lew commented 1 month ago

maybe a typo and you mean i=0, or i=V?

Ah, I was thinking with 1-indexed subscripts. So $i=1, \dots, V$ for the actual vocabulary, and $i=V+1$ for the special symbol. But maybe we should 0-index to keep closer to the code.