Open tsaastam opened 4 years ago
Thanks for the details, & especially the observation that the stagnant loss-total is exactly 128*1024*1024
(2^27) – that is odd, and likely relevant. (It's also good to be able to see that your corpus is an in-memory list – so it's not some iteration-from-elsewhere that's malfunctioning – and your model parameters aren't peculiar.)
Unfortunately the loss-calculation feature was only half-thought-out, & incompletely implemented, with pending necessary fixes & improvements (per #2617).
It'd be interesting to know if in your setup that reproduces the issue:
actual training is still occurring – for example, do raw vectors-in-training change from epoch-to-epoch, even after loss stagnates?
if you manually reset the cumulative tally to zero in your handler – either after you notice it stagnating, or even after each epoch – does that restore the expected growth-per-epoch?
OK, looks like the original implementation of loss tracking chose to use a 32-bit float. Limited-precision floating-point numbers of course become 'coarser' as they get further from 0.0
, and by the time they reach 2^27
, the immediate-next-representable number is already far more than 1.0 away. As a result, tallying more small numbers into this total won't have any effect.
Representative weirdness:
In [1]: import numpy as np
In [2]: a = np.ndarray(1, dtype=np.float32)
In [3]: a[0] = 134217728.0
In [4]: a[0] # it's already an unexpected displayed value
Out[4]: 134217730.0
In [5]: a[0] = a[0] + 2.0
In [6]: a[0] # adding a small value did nothing
Out[6]: 134217730.0
In [7]: np.nextafter(a[0], np.finfo(np.float32).max) # next possible larger float32 is 10 higher
Out[7]: 134217740.0
In [8]: np.nextafter(a[0], 0) # next possible smaller float32
Out[8]: 134217727.99999999
Fixing the existing implementation to be a per-epoch tally would make the problem far less likely to occur (but a sufficiently large epoch might still trigger it). Using a highest-precision type – why not np.float128
? – for the loss tally might make a recurrence unthinkable. But the whole flaky feature needs a re-look to match real user needs.
@tsaastam, given this, I'd expect the answers to my "interesting to know" bullet-points are...
yes, further training is still happening
manually resetting the tally each epoch-end should serve as a workaround in most cases
...but it'd be good to hear if that's the case for you.
I've updated the code to this to better see what's going on:
class MyLossCalculator(CallbackAny2Vec):
def __init__(self):
self.epoch = 1
self.losses = []
self.cumu_losses = []
self.previous_epoch_time = time.time()
def on_epoch_end(self, model):
cumu_loss = model.get_latest_training_loss()
loss = cumu_loss if self.epoch <= 1 else cumu_loss - self.cumu_losses[-1]
norms = [linalg.norm(v) for v in model.wv.vectors]
now = time.time()
epoch_seconds = now - self.previous_epoch_time
self.previous_epoch_time = now
print(f"Loss after epoch {self.epoch}: {loss} / cumulative loss: {cumu_loss} "+\
f" -> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
self.epoch += 1
self.losses.append(loss)
self.cumu_losses.append(cumu_loss)
Annoyingly, on this laptop I'm using right now, the loss stays at a very high level for the first 40 epochs, at around 1.8 million, whereas on the PC it had gone down to about 680k by epoch 40. The only difference is the updated loss calculator as above and the number of workers.
Anyway, assuming that's all benign, the good news is that training is still occurring after the reported loss goes to zero:
Loss after epoch 38: 1806704.0 / cumulative loss: 132444312.0 -> epoch took 51.58 s - vector norms min/avg/max: 0.02, 5.17, 13.53
Loss after epoch 39: 1773416.0 / cumulative loss: 134217728.0 -> epoch took 52.38 s - vector norms min/avg/max: 0.02, 5.2, 13.59
Loss after epoch 40: 0.0 / cumulative loss: 134217728.0 -> epoch took 52.39 s - vector norms min/avg/max: 0.02, 5.22, 13.64
Loss after epoch 41: 0.0 / cumulative loss: 134217728.0 -> epoch took 52.34 s - vector norms min/avg/max: 0.02, 5.25, 13.68
Loss after epoch 42: 0.0 / cumulative loss: 134217728.0 -> epoch took 52.27 s - vector norms min/avg/max: 0.02, 5.28, 13.71
Loss after epoch 43: 0.0 / cumulative loss: 134217728.0 -> epoch took 52.21 s - vector norms min/avg/max: 0.02, 5.3, 13.76
Your explanation makes sense - it didn't look like an overflow at first glance, as the loss changes pretty massively between epochs; but of course the cumulative loss is computed with lots of tiny updates, each of which is getting ignored as you've detailed. So that seems to be the cause.
I'm not sure why the training is much slower on the laptop (i.e. the loss is much higher after 40 full epochs), but that probably isn't related to this. In case it matters, here are the versions of things on the laptop:
Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
NumPy 1.17.5
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 1
Not sure why the NumPy and SciPy versions are slightly different (on Windows they were 1.17.3 and 1.3.1).
@tsaastam If you try setting model.running_training_loss
to 0.0 on each epoch end, does tallying continue to work into your later epochs?
(Also, separate from this bug: that's a lot of epochs! The last few epoch-deltas before the problem already show epoch loss jittering up-and-down; you may already be past the point, possibly far past the point, where more epochs are doing any good.)
You're probably right about additional epochs not helping; my concern though is that on the Windows machine the loss drops fairly quickly from about 6 million to around 700k - here between epochs 8 and 9:
(That image is with the bug, so the drop at the end is due to the cumulative loss suddenly dropping to zero - ignore that part.)
Anyway, with the laptop, with the same data, the loss is still staying at around 1.8 million by epoch 39. Of course I now realise it's not quite the same code, since on the laptop I was running the version that measures the magnitude of the word vectors after each epoch... but that shouldn't interfere with the training? I might need to investigate this seeming training discrepancy a bit more, then maybe open a separate issue about it if it seems like a real thing.
On the loss issue: taking your advice and adding model.running_training_loss = 0.0
to the end of the loss calculating method (and removing the cumulative loss stuff as it's no longer needed), the problem indeed seems to be resolved. Here's my loss calc now (the custom cumulative loss isn't really needed except for some debugging convenience):
class MyLossCalculator(CallbackAny2Vec):
def __init__(self):
self.epoch = 1
self.losses = []
self.cumu_loss = 0.0
self.previous_epoch_time = time.time()
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
norms = [linalg.norm(v) for v in model.wv.vectors]
now = time.time()
epoch_seconds = now - self.previous_epoch_time
self.previous_epoch_time = now
self.cumu_loss += float(loss)
print(f"Loss after epoch {self.epoch}: {loss} (cumulative loss so far: {self.cumu_loss}) "+\
f"-> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
self.epoch += 1
self.losses.append(float(loss))
model.running_training_loss = 0.0
And (running on the laptop again), the loss updates now seem to work fine:
Building vocab...
Vocab done. Training model for 100 epochs, with 6 workers...
Loss after epoch 1: 35497984.0 (cumulative loss so far: 35497984.0) -> epoch took 62.76 s - vector norms min/avg/max: 0.02, 1.69, 9.03
Loss after epoch 2: 34189408.0 (cumulative loss so far: 69687392.0) -> epoch took 57.72 s - vector norms min/avg/max: 0.02, 2.17, 9.79
Loss after epoch 3: 33939008.0 (cumulative loss so far: 103626400.0) -> epoch took 59.45 s - vector norms min/avg/max: 0.02, 2.5, 10.18
Loss after epoch 4: 33727968.0 (cumulative loss so far: 137354368.0) -> epoch took 63.11 s - vector norms min/avg/max: 0.02, 2.76, 10.22
Loss after epoch 5: 33630772.0 (cumulative loss so far: 170985140.0) -> epoch took 53.8 s - vector norms min/avg/max: 0.02, 2.97, 10.2
Loss after epoch 6: 33635360.0 (cumulative loss so far: 204620500.0) -> epoch took 61.26 s - vector norms min/avg/max: 0.02, 3.15, 10.08
After epoch 5 there, the accumulated loss is around 170 million, which is of course more than the 134ish million where the problem occurred before. So your workaround is good.
(The loss still goes down much more slowly here than on the Windows PC earlier, as I said I need to investigate that a bit more.)
It's good to know that running model.running_training_loss = 0.0
at each epoch-end, and thus essentially modifying the model's tally to be per-epoch rather than multi-epoch, improves the issue for you. (It may not be a total fix, as larger datasets may still suffer from imprecise or stalled within-epoch loss tallies from this same limit.)
Not sure what could be causing your other possibly-anomalous behavior – that very-slight loss-improvement in your last "on the laptop again" figures seems fishy, especially for early epochs where the effective alpha
is high and the improvement from the original random-intialization of each example/epoch should be large – but I'll let you dig into that, if you have lingering issues, you might want to post as update/new-message to discussion list.
@tsaastam
model.running_training_loss.dtype
maybe change per calculating model.running_training_loss
. can you check type?
@Cartman0 What do you mean? Are you encountering an error or expecting some efficiency problem? (Behind the scenes, I believe the value in the Python object is being copied-into a C-structure for the Cython code, then copied back out after that code tallies all the tiny errors. So that C-type, still just a 32-bit float, will be most relevant for the tallying behavior. That's probably still too coarse overall for accurate & robust loss-reporting – so this per-epoch reset to 0.0
isn't a general fix for the underlying problem – but helps a bit in this case.)
@gojomo I thought it may have type problem because of using to nest np.sum
, np.log
, np.special.expit
in calculating model.running_training_loss
. it don't have type problem?
@Cartman0 I’m not sure what you mean here by ‘type problem’. What chain-of-operations could lead to a bad result? (It’s possible there’s a problem – as the use of float32
for a large long-running tally demonstrates, this code wasn’t written with the deepest analysis – but I don’t know what error/mismatch is your concern.)
@gojomo sorry to write it confusingly.
I also think some cause (such as Information loss) to iterate calculating 32 bit float, because of max at 2^27.
but, I understand python object float is 64 bit precision, so, model.running_training_loss
can be 64 bit precision. In calculating model.running_training_loss
and chain-of-operations, unintended implicit type conversion happen ?
@Cartman0 The actual loss computation, and tallying, occurs in the Cython code – which is compiled to C/C++ & has a fixed float32
type, no matter what Python type is used in the Python model.running_training_loss
object. For example, it's copied-to a c-structure and then copied-out (& my guess is that the problem in #2743 is caused by multiple threads doing this without regard to each others' interim updates).
@gojomo ok,thx. i'm not familiar with Cython..
functions actually computing loss are w2v_fast_sentence_sg_hs
and w2v_fast_sentence_cbow_hs
in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec_inner.pyx ?
@Cartman0 Yes, those 2 functions, and also the negative-sampling versions of those functions, w2v_fast_sentence_sg_neg
& w2v_fast_sentence_cbow_neg
.
@gojomo setting model.running_training_loss to 0.0
at an epoch's end seems to affect the training.
Sample callback:
class LossReportCallback(CallbackAny2Vec):
def __init__(self, reset_loss=False):
self.epoch = 1
self.previous_cumulative_loss = 0
self.reset_loss = reset_loss
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
if self.reset_loss:
model.running_training_loss = 0.0
else:
loss_now = loss - self.previous_cumulative_loss
self.previous_cumulative_loss = loss
loss = loss_now
if self.epoch % 5 == 0:
logging.info(f'loss after epoch {self.epoch}: {loss}')
self.epoch += 1
The output if executed with reset_loss=False
INFO loss after epoch 5: 482677.875
INFO loss after epoch 10: 439727.5
INFO loss after epoch 15: 439261.0
INFO loss after epoch 20: 406250.0
INFO loss after epoch 25: 397587.0
INFO loss after epoch 30: 395241.0
INFO loss after epoch 35: 392499.0
INFO loss after epoch 40: 338352.0
INFO loss after epoch 45: 322680.0
INFO loss after epoch 50: 322566.0
The output if executed with reset_loss=True
INFO loss after epoch 5: 499439.3125
INFO loss after epoch 10: 498720.34375
INFO loss after epoch 15: 497191.375
INFO loss after epoch 20: 495352.28125
INFO loss after epoch 25: 501985.4375
INFO loss after epoch 30: 495859.59375
INFO loss after epoch 35: 493470.78125
INFO loss after epoch 40: 493170.25
INFO loss after epoch 45: 492609.6875
INFO loss after epoch 50: 496730.125
@gojomo setting
model.running_training_loss to 0.0
at an epoch's end seems to affect the training.
Yes, that workaround is absolutely expected to change the reported loss numbers, as precision will no longer be lost due to the tally reaching representational extremes. Improving the loss numbers is the whole point of the workaround. You are reporting suggestive evidence that the workaround works.
It shouldn't have any effect on the quality of training results, as this tally (whether for all-epochs or one-epoch) isn't consulted for any model-adjustment steps.
pnezis reports that epoch-wise loss changes by resetting model.running_training_loss=0
on the end of each epoch. And I get the similar result.
If we do not reset it, model loss continues to decrease and it seems to be a successful training. But if we reset it, model loss stagnates around 49000 in his example.
Doesn't this mean that model training (or parameter updating) is affected by model.running_training_loss
?
Or, can we get good model in later case (model.running_training_loss=0) even if loss does not decrease from certain value?
@DaikiTanak - The running loss tally is very buggy without a per-epoch reset. For large enough training sets, it might also suffer precision issues in a single epoch.
But the actual training that happens, on individual (context->word) examples, is the same either way. That's not affected by this running loss tally in any way. Only the reporting-out is changing. And any reported-out tally of aggregate loss is not a measure of model quality, only model 'convergence' (reaching a point where it can't, given its structure/state, be optimized any more). A model with a higher loss tally might be better on real world problems; an embedding model with a 0.0 loss is likely broken (severely overfit).
@gojomo Thank you for kind explaining. I understand that reported loss can be used to judge model convergence.
For best practice, we should reset running loss like model.running_training_loss = 0
on the end of each epoch, and see epoch-wise losses to judge if model training is converged or not.
For example, from the above figure (epoch vs epoch-wise loss with resetting model.running_training_loss=0
), we can say that the model may converge at epoch 25 or so.
All I'd say for sure is that resetting each epoch is better than not. As mentioned, large-enough epochs might show the same bug within a single epoch. Not knowing what code/data generated that graph, it's hard for me to endorse any idea of what it means.(The gradual trend 'up' from X=50 to x=200 is suspicious.)
Thanks @gojomo, the above graph is generated by following codes. I also think gradual trend up is suspicious and may be caused by some bugs.
callback code
class callback(CallbackAny2Vec):
'''Callback for Word2vec with resetting loss on the end of each epoch.'''
def __init__(self):
self.epoch = 1
self.epoch = 1
self.losses = []
self.cumu_loss = 0.0
self.previous_epoch_time = time.time()
self.best_model = None
self.best_loss = 1e+30
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
norms = [np.linalg.norm(v) for v in model.wv.vectors]
now = time.time()
epoch_seconds = now - self.previous_epoch_time
self.previous_epoch_time = now
self.cumu_loss += float(loss)
print(f"Loss after epoch {self.epoch}: {loss} (cumulative loss so far: {self.cumu_loss}) "+\
f"-> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
self.epoch += 1
self.losses.append(float(loss))
# reset loss inside model
model.running_training_loss = 0.0
if loss < self.best_loss:
self.best_model = copy.deepcopy(model)
self.best_loss = loss
if self.epoch % 50 == 0:
self.plot(path="../model/word2vec/w2v_training_loss.png"))
def plot(self, path):
fig, (ax1) = plt.subplots(ncols=1, figsize=(6, 6))
ax1.plot(self.losses, label="loss per epoch")
plt.legend()
plt.savefig(path)
plt.close()
print("Plotted loss.")
training code
model = word2vec.Word2Vec(
size=100,
min_count=1,
window=5,
workers=4,
sg=1,
seed=46,
iter=200,
compute_loss=True,
)
print("building vocabulary...")
# sentence corpus is like : [["this", "is", "a", "dog"], ["he", "is", "a", "student"], ... ]
model.build_vocab(sentence_corpus)
print("training Word2Vec...") callbacker = callback() model.train( sentence_corpus, epochs=model.iter, total_examples=model.corpus_count, compute_loss=True, callbacks=[callbacker], )
I don't see any specific reason the per-epoch loss might be trending up, in your code, but a few other notes: (1) for reasons previously alluded to, and the fact that the tally doesn't include the effects of late-in-epoch adjustments on early-in-epoch examples, the model with the lowest end-of-epoch loss tally is not necessarily 'best'; (2) I've never actually .deepcopy()
on a Word2Vec
model, so there's some small chance that might not yield a truly independent copy; (3) min_count=1
is usually a bad idea with the word2vec algorithm, as low-frequency words don't get good word-vectors themselves and do interfere with other words' improvements.
@gojomo @tsaastam @DaikiTanak @pnezis Hi! I have the same question: right from the start of training I get 134217728.0 loss after each epoch (constantly). What decision have been found finally?
@loveis98 The bugs limiting the usefulness/interpretability of the Word2Vec
loss-reporting remain until matters described in #2617 (& related to this & other bugs) truly addresses. The initial work in #2922 might be a starting basis for some real fixes, but there's no one prioritizing/working-on this at the moment, afaik.
Manually resetting the tally to 0.0
before each epoch (which you've not mentioned whether you're yet doing) may help the per-epoch readout work better – but larger runs will still face imprecision/maxing-out risks within a single epoch.
Cumulative loss of word2vec maxes out at 134217728.0
I'm training a word2vec model with 2,793,404 sentences / 33,499,912 words, vocabulary size 162,253 (words with at least 5 occurrences).
Expected behaviour: with
compute_loss=True
, gensim's word2vec should compute the loss in the expected way.Actual behaviour: the cumulative loss seems to be maxing out at
134217728.0
:And it stays at
134217728.0
thereafter. The value134217728.0
is of course exactly128*1024*1024
, which does not seem like a coincidence.Steps to reproduce
My code is as follows:
The data is a news article corpus in Finnish; I'm not at liberty to share all of it (and anyway it's a bit big), but it looks like one would expect:
Versions
The output of:
is:
Finally, I'm not the only one who has encountered this issue. I found the following related links:
https://groups.google.com/forum/#!topic/gensim/IH5-nWoR_ZI
https://stackoverflow.com/questions/59823688/gensim-word2vec-model-loss-becomes-0-after-few-epochs
I'm not sure if this is only a display issue and the training continues normally even after the cumulative loss reaches its "maximum", or if the training in fact stops at that point. The trained word vectors seem reasonably ok, judging by
my_model.wv.evaluate_word_analogies()
, though they do need more training than this.