We're currently using adaptive SGD (mostly RMSprop) and hoping that the default parameters work for us. Many recent (and not so recent) deep learning papers have a schedule for decreasing the learning rate over time, often phrased something like "halve the learning rate every 50 epochs". Here's an extreme case from Densely Connected Convolutional Networks (learning rate drop by 10x accompanied with major decrease in loss):
I imagine the following three parameters would be a useful addition to the cross-validation loop used to determine optimal hyperparameters:
I played around with learning rate informally and did not see opportunities for easy wins, although I didn't explore the full space of decay schedules etc. Closing for now.
We're currently using adaptive SGD (mostly RMSprop) and hoping that the default parameters work for us. Many recent (and not so recent) deep learning papers have a schedule for decreasing the learning rate over time, often phrased something like "halve the learning rate every 50 epochs". Here's an extreme case from Densely Connected Convolutional Networks (learning rate drop by 10x accompanied with major decrease in loss):
I imagine the following three parameters would be a useful addition to the cross-validation loop used to determine optimal hyperparameters: