Open Jugaris89 opened 4 years ago
I think this is because, for every single update of the architecture alpha, you have to train the model till convergence, and then do one single architecture update. Instead, to reduce the complexity, you can just do it linearly i.e. one weight update, one arch update, one weight update, one arch update and so on till convergence. This will greatly reduce the time needed. Think of this something like stochastic gradient descent where you do not go in the direction of the greatest gradient always, but when you train for a sufficient amount of time, the end result is that you reach the minima.
Hi,
great work!! I have a short question regarding a couple of sentences in the paper. When it is commented: "The inner optimization argmin Ltrain(w,alpha) can be expensive" "The idea is to approximate by adapting using only a single training step, without solving the inner optimization completely by training until convergence"
Why is single training step needed? do you have some estimation how costly it'd be to solve the inner optimization training until convergence that motivates the appoximation proposed in the paper?
Thanks in advance and congratulations again for this work!