Closed haskile closed 9 years ago
Ah, good question. This stems at least in part from a misunderstanding about what the rng parameter does, which is not documented well enough.
The problem is that the rng parameter is only used to generate random numbers within the model, e.g., for creating dropout masks or adding gaussian noise. The rng is not used to shuffle or select the training data -- theanets uses numpy for that, in the Dataset class! So your model in the two runs could be trained on a different ordering of the datasets.
One way you might be able to fix this is to seed the numpy random number generator (np.random.seed(42)
). Could you give this a try and report back whether this gives you consistent model runs?
Sorry, I believed I've already answered to you. Yes, setting np.random.seed
worked well.
Could you add a parameter random_state to Dataset class for selecting and shuffling data to make the code reproducible without setting global numpy.random.seed
?
I really would like to avoid changing global seed.
The Dataset object has actually moved to the downhill
package -- could you file an issue for this over there? https://github.com/lmjohns3/downhill
Issue is moved: https://github.com/lmjohns3/downhill/issues/3
Hi! I've seen you updated downhill.
Didn't you forget to specify rng in constructor of dataset (in theanets)?
I left it out to preserve simplicity for the default case. I figure if anyone is using theanets
and really wants to make their runs repeatable, then they can create a downhill.Dataset
instance themselves and pass it to the theanets
code.
Running this code twice returns different AUC scores, whereas it has a fixed rng parameter. According to my understanding, this should guarantee consistent behaviour, but it does not.