inpho / topic-explorer

System for building, visualizing, and working with LDA topic models
https://www.hypershelf.org/
Other
92 stars 22 forks source link

`topicexplorer.from_config` #310

Open JaimieMurdock opened 6 years ago

JaimieMurdock commented 6 years ago

Originally raised in #150

Below is a mockup of the interface we're aiming for:

import topicexplorer
te = topicexplorer.from_config('ap.ini')

# access the corpus with .corpus
te.corpus

# access the individual models with dictionary attributes
assert isinstance(te[k], LdaCgsViewer)
te[k].theta
te[k].phi

# comparing two models using the interface
import topicexplorer.analysis
topicexplorer.analysis.model_dist(te[20], te[40])

# integrated past_to_text analysis
ordered_ids = ['some', 'labels', 'by', 'date']
p2t = topicexplorer.analysis.past_to_text(te[20], ordered_ids)
### returns raw numbers

# possible plot library?
import topicexplorer.analysis.plot
topicexplorer.analysis.plot.past_to_text(p2t)

Some other thoughts:

# accessing doc-topic distributions
te[20].doc_topics('some-document') == te[20]['some-document']
# getting specific topic proportion:
te[20]['some-document'][2]

# accessing word-topic distributions
te[20].topics(2) == te[20][2]
te[20].topics(2)[te[20].topics(2)[word=='something']] == te[20][2]['something']

This is too much for a single ticket, and definitely more of what I'm thinking for a 2.0, but I want to get at least to the point where the models are loaded with topicexplorer.from_config() in notebooks.

JaimieMurdock commented 6 years ago

@colinallen Referring to your original comments on what happens when models are incommensurate, the methods I have reduce the vocabulary to the union of the two corpora and only compare topic distance on the remaining terms, but do not re-normalize the distributions. This at least maintains that we have a probabilistic source signal yielding tokens, and then non-assigned portions of the distribution (that is the parts of the vocabulary in the difference) do not contribute to the model distance.

JaimieMurdock commented 6 years ago

Some desired improvements for the Corpus objects:

import topicexplorer
te = topicexplorer.from_config('sep.ini')

# use dictionary access to get the tokens
assert te.corpus['neo-kantianism'] == [25, 37, 141312, 12, ...] 

# assert a document label is in the Corpus object
assert 'neo-kantianism' in te.corpus