Bigram distribution over strings

alex-lew commented 1 month ago

Once we have addressed #3, we will want to add a simple distribution over string-valued data.

I suggest the following setup:

We assume an alphabet $\Sigma$ of $V$ allowed characters for the strings. This could be fixed, or an input to the distribution's constructor, as $k$ is in Emily's implementation of Dirichlet-Categorical (#1).
The likelihood is parameterized by a transition matrix $T \in \mathbb{R}^{(V+1) \times (V+1)}$, where $T_{i,j}$ is the probability of transitioning from letter $i$ to letter $j$, and we interpret $i=V+1$ as a special 'start symbol' and $j=V+1$ as a special 'end symbol'. Note that each row of the matrix should sum to 1. Then the likelihood of a string $x_1, \dots, xn$ is the product $T{V+1, x1} \cdot \left(\prod{1 \leq i \leq n-1}T_{xi,x{i+1}}\right) \cdot T_{x_n, V+1}$.
The prior over $T$ is given by a product of $V+1$ Dirichlet distributions, one for each row of the transition matrix. For now, they could all be symmetric Dirichlet distributions with the same $\alpha$ parameter.

The state we track would be a matrix of observed transition counts -- how often did we transition from letter i to letter j, for each letter?

Actually, I think it should be possible to implement this internally in terms of a vector of Dirichlet-Categorical distributions.

Joaoloula commented 1 month ago

This could be fixed, or an input to the distribution's constructor, as k is in Emily's implementation of Dirichlet-Categorical

I like the latter option---would give us more flexibility to bridge between ascii, unicode, token vocabularies etc

$i=V+1$ as a special 'start symbol' and $j=V+1$ as a special 'end symbol'.

maybe a typo and you mean i=0, or i=V?

Actually, I think it should be possible to implement this internally in terms of a vector of Dirichlet-Categorical distributions.

sounds good to me!

alex-lew commented 1 month ago

maybe a typo and you mean i=0, or i=V?

Ah, I was thinking with 1-indexed subscripts. So $i=1, \dots, V$ for the actual vocabulary, and $i=V+1$ for the special symbol. But maybe we should 0-index to keep closer to the code.

probcomp / hierarchical-irm

Bigram distribution over strings #7