inpho / vsm

Vector Space Model Framework developed for InPhO
http://inpho.github.io/vsm
Other
35 stars 14 forks source link

Store random seed information and state in Model objects #107

Closed JaimieMurdock closed 9 years ago

JaimieMurdock commented 9 years ago

For the model objects, generate and store a random seed and store the state after each completion of train. This ensures that results are fully replicable.

The complete implementation will need to use

def __init__(self, ..., seed=None):
    if self.seed is None:
        maxint = iinfo(np.uint32).max
        self.seed = numpy.random.randint(0, maxint)
        self._mtrand_state = None

def train(self, ...):
    random_state = np.random.RandomState(self.seed)
    if self._mtrand_state is not None:
        random_state.set_state(self._mtrand_state)

    # train model

    self._mtrand_state = random_state.get_state()

et voila! Replication!

JaimieMurdock commented 9 years ago

Implementation note: the API will change from seeds being stored in Model.train() to Model(). We will need to do a version bump to 0.3 soon, and I will include this as the first alpha.

JaimieMurdock commented 9 years ago

After conversation with Robert about implementation for LdaCgsMulti, I think I've come to the following set of refactors

class LdaCgsMulti(LdaCgsSeq):
    # move both `n_proc` and `seeds` to `__init__`
    def __init__(self, ..., n_proc=2, seeds=None):
       self.n_proc = n_proc
       if self.seeds is None:
            # randomly init a seed for each n_proc
       self._mtrand_states = [Random(seed).get_state() for seed in self.seeds]

    # remove both `n_proc` and `seeds` from `train()`
    def train(self, ...):
        if multiprocessing.cpu_count() < n_proc:
            raise RuntimeError("Model seeded with more cores than available."+
                               " Requires {0} cores.".format(n_proc))
        # do similar self._mtrand_states things as in the above LdaCgsSeq implementation

Notice that this is in line with the central design goal of vsm: fast performance on single workstation compute environments. If you are placing models on different machines, then random state preservation is an exercise left to the user.

JaimieMurdock commented 9 years ago

This has been implemented, but test cases are not yet developed for either LdaCgsSeq or LdaCgsMulti.

The test cases should be:

  1. Initialize LDA with a single seed. Run for 5 iterations. Initialize second LDA with the same seed. Run for 5 iterations. Check if m1._mtrand_state == m2._mtrand_state.
  2. Same as above, but with random seed generation on the first model, pass that random seed to the second model.
  3. Create 2 models, ensure random seeds are different. Train and ensure states are different.

Additionally, the LdaCgsMulti tester will need to follow the patterns shown in this StackOverflow answer in order to properly run.

JaimieMurdock commented 9 years ago

closed in v0.3 release

JaimieMurdock commented 9 years ago

Still needs to have save+load+continue functionality and cleanup to seed_or_seeds semantics in LDA.

JaimieMurdock commented 9 years ago

Added save+load+continue functionality, now cleaning up seed_or_seeds semantics in vsm.model.LDA usability class