Closed jwise closed 8 years ago
Examples are supposed to cover common use cases. If you do expect your experiments to overflow a 32-bit integer, then yes, please explicitly set it to int64.
Hmm, this might be an indication that what I'm doing is unreasonable :-) But my understanding is that "more data is better", and that 12 hour training runs are not super uncommon in the world of DNNs. (My inexperience may show through here!) I had actually (obviously, incorrectly) assumed that the type of that Variable would have been some kind of floating point datatype, given that it wasn't specified in the documentation.
Given how much detail the rest of the (wonderful!) documentation goes into about gotchas and issues that one might have, I think I respectfully disagree: I tripped on this relatively readily (after only a week or so of using the tool), and it cost me 10 hours or so of training time and a day's worth of results, and I think that highlighting this could help new users to think through what exactly is going on in the examples.
But, it's your project! Either way, thanks for the consideration.
As far as I can tell,
tf.train.exponential_decay
examples seem to use a 32-bit signed number for global_step, because the batch number is also atf.int32
. This means that long runs can result in unpleasant surprises with learning rates, which result in a frustrating experience for new users. [1] Examples should be updated to initializebatch
as adtype=tf.int64
.I originally believed this to be a
tf.train.exponential_decay
bug, and wrote some code to minimize the issue. I now understand that this comes from the variable, but you can have the below repro case anyway, because I think it makes it a little more obvious as to the kind of thing that can happen. When you run it, you'll note a discontinuity between epoch 85 and 86 in learning rate that comes frombatch * batch_size
overflowing (and results in losing an evening's training...).[1] No, I did not checkpoint midway through. Yes, I have learned my lesson.