P(w) computation doesn't take in account weights if counts=True

fvalle1 commented 4 years ago

Hi.

I'm trying to reconstruct P(w) by multiplying P(word|topic)P(topic|sample)P(sample) namely applying matrix multiplication at p_w_tw times p_tw_d.

If I understood right everything P(w) should be the frequency of the word in the corpus. Considering the corpus in your paper, this didn't happen though...

p_w_topsbm_original.pdf

I noticed that in get_groups method the rows

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    n_db[int(v1), z1] += 1
    n_dbw[int(v1), z2] += 1
    n_wb[int(v2) - D, z2] += 1

don't take care of edges' weight. This is an issue if the graph was built with `counts=True'.

Modifying that rows to

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    weight = g.ep["count"][e]
    n_db[int(v1), z1] += weight
    n_dbw[int(v1), z2] += weight
    n_wb[int(v2) - D, z2] += weight

The P(w) obtained from P(w|tw)P(tw|d)P(d) is actually the frequency

p_w_topsbm.pdf

Is there something wrong with my assumptions? I mean multiplying P(w|tw) times P(tw|d) times P(d) should give the frequency of word w, right?

The probabilities should take into account the weight of the edges, or is there some other factor that I'm missing?

Thank you

Filippo

martingerlach commented 4 years ago

Filippo, you are completely right. This is an error when using counts=True. I write the initial code without that feature and didnt realize it would affect here. Thanks for catching this one. Do you want to do a pull-request? Otherwise I can add the fix later

fvalle1 commented 4 years ago

Thank you for your prompt reply.

I opened the PR that fixes this.

Have a nice day!

martingerlach / hSBM_Topicmodel

P(w) computation doesn't take in account weights if counts=True #11