sanagno / adaptively_sparse_attention

17 stars 0 forks source link

Entmax optimization problem #1

Open azhx opened 7 months ago

azhx commented 7 months ago

Hi, On your paper, you state that alpha-sigmoid is defined as

image

However, the entmax_bisect function you use solves the optimization for

max_p <x, p> - H_a(p)

Can you clarify this discrepancy?

sanagno commented 7 months ago

Hi,

You can define the $\alpha$-sigmoid(x) (using parameters $p_x$) with respect to the $\alpha$-entmax($\bf{y}$) (using parameters $p_y$), by setting $\bf{y} = [x, 0]$ and $\bf{p_y} = [p_x, 1 - p_x]$. Let me know if that does not answer your question!

azhx commented 7 months ago

I understand that this is how you're defining $\alpha$-sigmoid, it's just that the docstring for the entmax_bisect function says the optimization being solved is max_p <x, p> - H_a(p). In the original paper by Peters, et. al, they also seem to say that the bisection algorithm is to solve the maximization problem with the addition rather than the subtraction. Maybe the docstring has a typo? Or have I missed something mathematically. I haven't gone deep into the bisection algorithm itself.

Anyways, I understand how your methods works, so all good. thanks!