Open JaimieMurdock opened 6 years ago
@colinallen Referring to your original comments on what happens when models are incommensurate, the methods I have reduce the vocabulary to the union of the two corpora and only compare topic distance on the remaining terms, but do not re-normalize the distributions. This at least maintains that we have a probabilistic source signal yielding tokens, and then non-assigned portions of the distribution (that is the parts of the vocabulary in the difference) do not contribute to the model distance.
Some desired improvements for the Corpus
objects:
import topicexplorer
te = topicexplorer.from_config('sep.ini')
# use dictionary access to get the tokens
assert te.corpus['neo-kantianism'] == [25, 37, 141312, 12, ...]
# assert a document label is in the Corpus object
assert 'neo-kantianism' in te.corpus
Originally raised in #150
Below is a mockup of the interface we're aiming for:
Some other thoughts:
This is too much for a single ticket, and definitely more of what I'm thinking for a 2.0, but I want to get at least to the point where the models are loaded with
topicexplorer.from_config()
in notebooks.