Open guptakartik opened 7 years ago
Hi, I just tried running the MNIST experiment and am hitting nans there too. It's been a while since I've ran that example and I've changed the qpth library since the MNIST experiment was last working. It looks like the solver's hitting some nans internally, causing the precision issue and bad gradients. For now you can try reverting to an older commit of qpth, one from around the time I last updated the MNIST example. I'll try to look into the internal solver issues soon.
-Brandon.
Thanks for the quick reply! I will try working with the older commit of qpth.
Hi, I tried most of the early versions of qpth but none of them works. They fail in various ways, mostly inside qpth. Could you check which version can work?
Hi Brandon, It would be really helpful if you could point us to the right version of qpth, since we have been unable to get it to work.
Hi, the nans were coming up in the backwards pass in qpth and I've pushed a fix to it here: https://github.com/locuslab/qpth/commit/e2cac495909159aae12461262d0ee540ddf9abd6
Here's the convergence of one of my new runs (I did modify z0
and s0
to be fixed, pull this from the latest version of this repo). For the loss being so jumpy, the LR should probably be bumped down:
Can you try running the training again with the latest versions of this repo and qpth
?
-Brandon.
Hi,
I was running the optnet code for MNIST classification with the default configurations for only 10 epochs. In the first couple of epochs I get the warning "qpth warning: Returning an inaccurate and potentially incorrect solutino" and in the subsequent iterations the loss becomes nan. Is there something obviously wrong with my configurations?