sherpa-ai / sherpa

Hyperparameter optimization that enables researchers to experiment, visualize, and scale quickly.
http://parameter-sherpa.readthedocs.io/
GNU General Public License v3.0
331 stars 53 forks source link

Running separate studies on separate, but NFS-linked machines delivers same parameters to all trials on all machines over and over #80

Closed tpanza closed 4 years ago

tpanza commented 4 years ago

Hello, I am trying to optimize the hyperparameters of 3 separate models, with each one being done on a completely separate machine. While they are completely separate models, they are made from the same codebase and are using the same hyperparameters and the same search space.

Each one is trying a RandomSearch, but there appears to be neither a search nor any randomness.

Here are screenshots from the dashboards of each machine. You can see that the parameters delivered to each trial are the same between trials and between machines! (other than the first trial)

Machine 1: mach1

Machine 2: mach2

Machine 3: mach3

All 3 machines happen to be "linked" to one another by virtue of them all mounting multiple, common NFS shares. One of those happens to be where the $HOME directory is located. I have been sleuthing around to see how these processes might be intercommunicating, such as caching data somewhere under $HOME, but I have yet to find anything.

Note that I am not trying to do anything related to the "parallel" feature. Just want to run 3 separate studies at the same time.

Any insight into how I can fix this would be greatly appreciated.

Environment (on all 3 machines):

LarsHH commented 4 years ago

Hi @tpanza ,

Just for me to get a better understanding, would you mind posting your code? If there is anything confidential in it feel free to remove that part, I just want to see the Sherpa parts.

Thanks, Lars

tpanza commented 4 years ago

Hi @LarsHH ,

Thanks for your help with this. Also, I have an important data point: After noticing the above problem, I killed the studies on two of the machines, then restarted the study on the third. I thought that I could at least do these serially. However, even with just the one study running on the one machine, the parameters were still stuck on the same values, starting with the second trial.

So maybe there is not some strange problem with the simultaneous studies stomping on each other after all.

Anyway, here is my code snippet:

First, I have the list of parameter dicts in a JSON file that gets read in:

{
  "hyperparameter_search": [
      {"type": "continuous", "name": "learning_rate", "range": [0.005, 0.05], "scale": "log"},
      {"type": "continuous", "name": "dropout_rate", "range": [0.0, 0.4]},
      {"type": "discrete", "name": "conv_filter_len_list", "range": [3, 5]},
      {"type": "discrete", "name": "conv_filter_size_list", "range": [30, 100]},
      {"type": "continuous", "name": "clip", "range": [5.0, 10.0]},
      {"type": "choice", "name": "rnn_unit", "range": ["gru", "lstm"]}
  ],
}

(The JSON is read in and converted to a Bunch object called config.)

Here is the Sherpa-related code snippet:

            parameters = [sherpa.Parameter.from_dict(param_dict) for param_dict in config.hyperparameter_search]
            algorithm = sherpa.algorithms.RandomSearch(max_num_trials=40)
            stopping_rule = sherpa.algorithms.MedianStoppingRule()
            study = sherpa.Study(parameters=parameters,
                                 algorithm=algorithm,
                                 stopping_rule=stopping_rule,
                                 lower_is_better=False) # higher is better, using avg f1
            for trial in study:
                self.log.info(f"Trial {trial.id}:, {trial.parameters}")
                config.learning_rate = trial.parameters["learning_rate"]
                config.dropout_rate = trial.parameters["dropout_rate"]
                config.conv_filter_len_list = [trial.parameters["conv_filter_len_list"]]
                config.conv_filter_size_list = [trial.parameters["conv_filter_size_list"]]
                config.clip = trial.parameters["clip"]
                config.rnn_unit = trial.parameters["rnn_unit"]
                num_iterations = 3
                for iteration in range(num_iterations):
                    tf.reset_default_graph()
                    sess = tf.Session()
                    # call functions that train the TensorFlow model
                    # reports back the max_dev_avg_f_score
                    study.add_observation(trial=trial,
                                          iteration=iteration+1,
                                          objective=max_dev_avg_f_score)
                    if study.should_trial_stop(trial):
                        self.log.warn(f"Stopping trial {trial.id} early. Params: {trial.parameters}")
                        break
                study.finalize(trial=trial)
            self.log.info(f"Completed study! Best result: {study.get_best_result()}")
tpanza commented 4 years ago

Trying again with adding repeat=0 to the RandomSearch.

So now the algorithm looks like this:

algorithm = sherpa.algorithms.RandomSearch(max_num_trials=40, repeat=0)

So far, looks promising. On Trial 2, I finally got something other than this

Trial 2:, {'learning_rate': 0.010199332883165988, 'dropout_rate': 0.05147012228027723, 'conv_filter_len_list': 3, 'conv_filter_size_list': 82, 'clip': 8.655417529521165, 'rnn_unit': 'lstm'},

which is what trial 2, and every subsequent trial, were stuck on, even across multiple, separate studies!

tpanza commented 4 years ago

I think I found the cause!

First, another data point. Using repeat=0 does not "fix" this. It just causes the parameter values selected for Trial 1 to get "locked in", rather than the Trial 2 values to get locked in. I was so fixated on getting different values for Trial 2, that I did not initially notice that Trial 1 and Trial 2 now have the same values. So using repeat=0 just "moves" the problem to Trial 1.

Trial 1:, {'learning_rate': 0.008407094167551032, 'dropout_rate': 0.1044248974356922, 'conv_filter_len_list': 4, 'conv_filter_size_list': 94, 'clip': 7.068790142935714, 'rnn_unit': 'lstm'}
Trial 2:, {'learning_rate': 0.008407094167551032, 'dropout_rate': 0.1044248974356922, 'conv_filter_len_list': 4, 'conv_filter_size_list': 94, 'clip': 7.068790142935714, 'rnn_unit': 'lstm'}

Proposed fix:

This line, in RandomSearch.get_suggestion(), needs to change from:

if self.j == self.m:

to:

if self.j >= self.m:

Basically, with each trial self.j continues to increment to values above self.m. When when we check whether self.j == self.m, and self.j is already greater than self.m, we never reset self.j back to 0, so we never take another sample of the parameters.

Thoughts?

tpanza commented 4 years ago

I think I also have a good theory as to the other issue: Why was I seeing the same "randomly" selected parameter values across studies and even across machines.

In my code, I have a data generator class that first seeds the generator: np.random.seed(seed), with a default value of that seed param set to an integer. Then I use np.random.shuffle() to shuffle the training data.

So then when sherpa is calling its sample() functions that are doing, for example, this: numpy.random.uniform(low=self.range[0], high=self.range[1]), the pseudo-RNG has already been seeded elsewhere in my code.

Perhaps sherpa needs to accept a seed parameter that defaults to None and then call np.random.seed() before taking a sample?

LarsHH commented 4 years ago

Hi @tpanza ,

Regarding the seed: that sounds right. Your proposition sounds good. I think the key would be to have Sherpa create its own random number generator with specified (or None) seed (as in here https://stackoverflow.com/questions/43927333/python-multiple-random-seeds ) so that it doesn't mess with the user random number generators.

Regarding your RandomSearch fix: that looks right. However, I'll actually remove the repeat option entirely now. Been wanting to do this for a while since there is a dedicated Repeat wrapper now anyway.

Thanks for your help on this. I'll make the corresponding PR and link it to this issue.

Best, Lars

tpanza commented 4 years ago

Thanks, @LarsHH. I have confirmed that the changes you merged in #81 have fixed this! I'll go ahead and close this issue.