Closed JaimieMurdock closed 9 years ago
Implementation note: the API will change from seeds being stored in Model.train()
to Model()
. We will need to do a version bump to 0.3
soon, and I will include this as the first alpha.
After conversation with Robert about implementation for LdaCgsMulti
, I think I've come to the following set of refactors
class LdaCgsMulti(LdaCgsSeq):
# move both `n_proc` and `seeds` to `__init__`
def __init__(self, ..., n_proc=2, seeds=None):
self.n_proc = n_proc
if self.seeds is None:
# randomly init a seed for each n_proc
self._mtrand_states = [Random(seed).get_state() for seed in self.seeds]
# remove both `n_proc` and `seeds` from `train()`
def train(self, ...):
if multiprocessing.cpu_count() < n_proc:
raise RuntimeError("Model seeded with more cores than available."+
" Requires {0} cores.".format(n_proc))
# do similar self._mtrand_states things as in the above LdaCgsSeq implementation
Notice that this is in line with the central design goal of vsm: fast performance on single workstation compute environments. If you are placing models on different machines, then random state preservation is an exercise left to the user.
This has been implemented, but test cases are not yet developed for either LdaCgsSeq
or LdaCgsMulti
.
The test cases should be:
LDA
with a single seed. Run for 5 iterations. Initialize second LDA
with the same seed. Run for 5 iterations. Check if m1._mtrand_state == m2._mtrand_state
.Additionally, the LdaCgsMulti
tester will need to follow the patterns shown in this StackOverflow answer in order to properly run.
closed in v0.3 release
Still needs to have save+load+continue functionality and cleanup to seed_or_seeds
semantics in LDA
.
Added save+load+continue functionality, now cleaning up seed_or_seeds
semantics in vsm.model.LDA
usability class
For the model objects, generate and store a random seed and store the state after each completion of train. This ensures that results are fully replicable.
The complete implementation will need to use
et voila! Replication!