martingerlach / hSBM_Topicmodel

Using stochastic block models for topic modeling
GNU General Public License v3.0
191 stars 40 forks source link

P(w) computation doesn't take in account weights if counts=True #11

Closed fvalle1 closed 4 years ago

fvalle1 commented 4 years ago

Hi.

I'm trying to reconstruct P(w) by multiplying P(word|topic)P(topic|sample)P(sample) namely applying matrix multiplication at p_w_tw times p_tw_d.

If I understood right everything P(w) should be the frequency of the word in the corpus. Considering the corpus in your paper, this didn't happen though...

p_w_topsbm_original.pdf

I noticed that in get_groups method the rows

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    n_db[int(v1), z1] += 1
    n_dbw[int(v1), z2] += 1
    n_wb[int(v2) - D, z2] += 1

don't take care of edges' weight. This is an issue if the graph was built with `counts=True'.

Modifying that rows to

for e in g.edges():
    z1, z2 = state_l_edges[e]
    v1 = e.source()
    v2 = e.target()
    weight = g.ep["count"][e]
    n_db[int(v1), z1] += weight
    n_dbw[int(v1), z2] += weight
    n_wb[int(v2) - D, z2] += weight

The P(w) obtained from P(w|tw)P(tw|d)P(d) is actually the frequency

p_w_topsbm.pdf

Is there something wrong with my assumptions? I mean multiplying P(w|tw) times P(tw|d) times P(d) should give the frequency of word w, right?

The probabilities should take into account the weight of the edges, or is there some other factor that I'm missing?

Thank you

Filippo

martingerlach commented 4 years ago

Filippo, you are completely right. This is an error when using counts=True. I write the initial code without that feature and didnt realize it would affect here. Thanks for catching this one. Do you want to do a pull-request? Otherwise I can add the fix later

fvalle1 commented 4 years ago

Thank you for your prompt reply.

I opened the PR that fixes this.

Have a nice day!