sronnqvist / doc2topic

Neural topic modeling
29 stars 10 forks source link

Generated topics for document only come from first document #4

Closed teresaibarra closed 5 years ago

teresaibarra commented 5 years ago

Hi there,

I was able to run this on my own dataset, but I'm getting some strange and unexpected behavior. I'm able to generate topics and weights for each document, but the topics returned are only words from the first document. I've attached some examples if this isn't clear. A fork to my code can be found here.

Changing the number of topics seems to help get indices of unique topics but I'm unsure how that impacts the generated model.

Running print_topic_words gives me the below output, which shows there are some great topics within my dataset:

1: walk, may, away, beauti, look, voic, hi, memori, turn, everyon
2: fun, look, hi, great, buy, hey, big, doe, better, new
3: bar, answer, peopl, befor, day, help, look, good, moment, da
4: drink, dum, thousand, fli, well, rememb, away, la, look, feet
5: bye, lovin, honey, faith, may, bop, wheel, find, readi, gone
6: gonna, way, heart, look, feelin, sometim, need, goodby, happen, someon
7: ah, ya, need, oo, wanna, nobodi, togeth, miss, know, ever
8: doo, song, oh, sing, boy, ever, wa, would, chee, miss
9: ladi, dead, deep, littl, boy, guy, leav, look, insid, bit
10: summer, easi, lay, heart, look, tryin, fall, la, two, black
11: river, kill, aliv, hard, long, day, summer, look, busi, shout
12: christma, snow, happi, year, go, knock, like, wa, train, littl
13: la, give, sing, gotta, look, ah, two, hundr, ring, onli
14: run, sail, wind, befor, hard, must, look, without, fun, make

Here are the topics generated for each song:

  {
    "artist": "ABBA",
    "title": "Ahe's My Kind Of Girl",
    "topics": [
      "make",
      "pleas",
      "plan",
      "believ",
      "without",
      "go",
      "squeez",
      "park",
      "fine",
      "could",
      "someth",
      "face",
      "mean",
      "talk",
      "hand",
      "wonder",
      "gentli",
      "hold",
      "hour"
    ]
  },
  {
    "artist": "ABBA",
    "title": "Andante, Andante",
    "topics": [
      "park",
      "mine",
      "hour",
      "like",
      "squeez",
      "blue",
      "face",
      "gentli",
      "thing",
      "pleas",
      "walk",
      "make",
      "believ",
      "take",
      "feel",
      "look",
      "girl",
      "see",
      "easi",
      "smile",
      "lucki",
      "without",
      "plan",
      "wonder",
      "fellow",
      "go"
    ]
  },
  {
    "artist": "ABBA",
    "title": "As Good As New",
    "topics": [
      "lucki",
      "fine",
      "make",
      "ever",
      "feel",
      "kind"
    ]
  }...

Note that all the generated topics are words from the first song "She's My Kind of Girl", but are not relevant for any other documents in my dataset. Please let me know if what I'm saying is unclear.

sronnqvist commented 5 years ago

Hi!

The issue seems to be in your get_document_topics_json():

for it_index, it_val in sorted_list:
     topic_list.append(self.corpus.idx2token[it_index])

get_document_topics() returns topic IDs for a given document, so sorted_list is a list of topic IDs. You convert these using corpus.idx2token, which is a mapping from token ID to token.

If you really want to obtain words for each document/song, you would need to combine document-topic assignments and topic-word assignments somehow. Although, I don't think that makes much sense, as it's easier and probably more meaningful to directly show the actual words of that document.

Considering that your data set is rather small, I would suggest starting out with fewer topics and/or getting more data.

-Samuel

teresaibarra commented 5 years ago

I may have misunderstood how your code works. To clarify, this project returns topic clusters for each document (ie the topic cluster would be 1, and words in that topic cluster would be 1: walk, may, away, beauti, look, voic, hi, memori, turn, everyon)?

sronnqvist commented 5 years ago

That's correct. The structure is the same as in Latent Dirichlet Allocation, i.e., a topic is a (fuzzy) cluster of documents, whose meaning is described by keywords.