piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

Word2vec: loss tally maxes at 134217728.0 due to float32 limited-precision #2735

Open tsaastam opened 4 years ago

tsaastam commented 4 years ago

Cumulative loss of word2vec maxes out at 134217728.0

I'm training a word2vec model with 2,793,404 sentences / 33,499,912 words, vocabulary size 162,253 (words with at least 5 occurrences).

Expected behaviour: with compute_loss=True, gensim's word2vec should compute the loss in the expected way.

Actual behaviour: the cumulative loss seems to be maxing out at 134217728.0:

Building vocab...
Vocab done. Training model for 120 epochs, with 16 workers...
Loss after epoch 1: 16162246.0 / cumulative loss: 16162246.0
Loss after epoch 2: 11594642.0 / cumulative loss: 27756888.0

[ - snip - ]

Loss after epoch 110: 570688.0 / cumulative loss: 133002056.0
Loss after epoch 111: 564448.0 / cumulative loss: 133566504.0
Loss after epoch 112: 557848.0 / cumulative loss: 134124352.0
Loss after epoch 113: 93376.0 / cumulative loss: 134217728.0
Loss after epoch 114: 0.0 / cumulative loss: 134217728.0
Loss after epoch 115: 0.0 / cumulative loss: 134217728.0

And it stays at 134217728.0 thereafter. The value 134217728.0 is of course exactly 128*1024*1024, which does not seem like a coincidence.

Steps to reproduce

My code is as follows:

class MyLossCalculator(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 1
        self.losses = []
        self.cumu_losses = []

    def on_epoch_end(self, model):
        cumu_loss = model.get_latest_training_loss()
        loss = cumu_loss if self.epoch <= 1 else cumu_loss - self.cumu_losses[-1]
        print(f"Loss after epoch {self.epoch}: {loss} / cumulative loss: {cumu_loss}")
        self.epoch += 1
        self.losses.append(loss)
        self.cumu_losses.append(cumu_loss)

def train_and_check(my_sentences, my_epochs, my_workers=8):
    print(f"Building vocab...")
    my_model: Word2Vec = Word2Vec(sg=1, compute_loss=True, workers=my_workers)
    my_model.build_vocab(my_sentences)
    print(f"Vocab done. Training model for {my_epochs} epochs, with {my_workers} workers...")
    loss_calc = MyLossCalculator()
    trained_word_count, raw_word_count = my_model.train(my_sentences, total_examples=my_model.corpus_count, compute_loss=True,
                                                        epochs=my_epochs, callbacks=[loss_calc])
    loss = loss_calc.losses[-1]
    print(trained_word_count, raw_word_count, loss)
    loss_df = pd.DataFrame({"training loss": loss_calc.losses})
    loss_df.plot(color="blue")
#    print(f"Calculating accuracy...")
#    acc, details = my_model.wv.evaluate_word_analogies(questions_file, case_insensitive=True)
#    print(acc)
    return loss_calc, my_model

The data is a news article corpus in Finnish; I'm not at liberty to share all of it (and anyway it's a bit big), but it looks like one would expect:

[7]: df.head(2)
[7]: [Row(file_and_id='data_in_json/2018/04/0001.json.gz%%3-10169118', index_in_file='853', headline='Parainen pyristelee pois lastensuojelun kriisistä: irtisanoutuneiden tilalle houkutellaan uusia sosiaalityöntekijöitä paremmilla työeduilla', publication_date='2018-04-20 11:59:35+03:00', publication_year='2018', publication_month='04', sentence='hän tiesi minkälaiseen tilanteeseen tulee', lemmatised_sentence='hän tietää minkälainen tilanne tulla', source='yle', rnd=8.436637410902392e-08),
     Row(file_and_id='data_in_xml/arkistosiirto2018.zip%%arkistosiirto2018/102054668.xml', index_in_file=None, headline='*** Tiedote/SDP: Medialle tiedoksi: SDP:n puheenjohtaja Antti Rinteen puhe puoluevaltuuston kokouksessa ***', publication_date='2018-04-21T12:51:44', publication_year='2018', publication_month='04', sentence='me haluamme jättää hallitukselle välikysymyksen siitä miksi nuorten ihmisten tulevaisuuden uskoa halutaan horjuttaa miksi epävarmuutta ja näköalattomuutta sekä pelkoa tulevaisuuden suhteen halutaan lisätä', lemmatised_sentence='me haluta jättää hallitus välikysymys se miksi nuori ihminen tulevaisuus usko haluta horjuttaa miksi epävarmuus ja näköalattomuus sekä pelko tulevaisuus suhteen haluta lisätä', source='stt', rnd=8.547760445010155e-07)]

sentences = list(map(lambda r: r["lemmatised_sentence"].split(" "), df.select("lemmatised_sentence").collect()))

[18]: sentences[0]
[18]: ['hän', 'tietää', 'minkälainen', 'tilanne', 'tulla']

Versions

The output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

is:

Windows-10-10.0.18362-SP0
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
NumPy 1.17.3
SciPy 1.3.1
gensim 3.8.1
FAST_VERSION 1

Finally, I'm not the only one who has encountered this issue. I found the following related links:

https://groups.google.com/forum/#!topic/gensim/IH5-nWoR_ZI

https://stackoverflow.com/questions/59823688/gensim-word2vec-model-loss-becomes-0-after-few-epochs

I'm not sure if this is only a display issue and the training continues normally even after the cumulative loss reaches its "maximum", or if the training in fact stops at that point. The trained word vectors seem reasonably ok, judging by my_model.wv.evaluate_word_analogies(), though they do need more training than this.

gojomo commented 4 years ago

Thanks for the details, & especially the observation that the stagnant loss-total is exactly 128*1024*1024 (2^27) – that is odd, and likely relevant. (It's also good to be able to see that your corpus is an in-memory list – so it's not some iteration-from-elsewhere that's malfunctioning – and your model parameters aren't peculiar.)

Unfortunately the loss-calculation feature was only half-thought-out, & incompletely implemented, with pending necessary fixes & improvements (per #2617).

It'd be interesting to know if in your setup that reproduces the issue:

gojomo commented 4 years ago

OK, looks like the original implementation of loss tracking chose to use a 32-bit float. Limited-precision floating-point numbers of course become 'coarser' as they get further from 0.0, and by the time they reach 2^27, the immediate-next-representable number is already far more than 1.0 away. As a result, tallying more small numbers into this total won't have any effect.

Representative weirdness:

In [1]: import numpy as np                                                                         

In [2]: a = np.ndarray(1, dtype=np.float32)                                                        

In [3]: a[0] = 134217728.0                                                                         

In [4]: a[0]  # it's already an unexpected displayed value                                                                 
Out[4]: 134217730.0

In [5]: a[0] = a[0] + 2.0                                        

In [6]: a[0]  # adding a small value did nothing                                                                             
Out[6]: 134217730.0

In [7]: np.nextafter(a[0], np.finfo(np.float32).max)  # next possible larger float32 is 10 higher
Out[7]: 134217740.0

In [8]: np.nextafter(a[0], 0)  # next possible smaller float32
Out[8]: 134217727.99999999

Fixing the existing implementation to be a per-epoch tally would make the problem far less likely to occur (but a sufficiently large epoch might still trigger it). Using a highest-precision type – why not np.float128? – for the loss tally might make a recurrence unthinkable. But the whole flaky feature needs a re-look to match real user needs.

@tsaastam, given this, I'd expect the answers to my "interesting to know" bullet-points are...

...but it'd be good to hear if that's the case for you.

tsaastam commented 4 years ago

I've updated the code to this to better see what's going on:

class MyLossCalculator(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 1
        self.losses = []
        self.cumu_losses = []
        self.previous_epoch_time = time.time()

    def on_epoch_end(self, model):
        cumu_loss = model.get_latest_training_loss()
        loss = cumu_loss if self.epoch <= 1 else cumu_loss - self.cumu_losses[-1]
        norms = [linalg.norm(v) for v in model.wv.vectors]
        now = time.time()
        epoch_seconds = now - self.previous_epoch_time
        self.previous_epoch_time = now
        print(f"Loss after epoch {self.epoch}: {loss} / cumulative loss: {cumu_loss} "+\
              f" -> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
              f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
        self.epoch += 1
        self.losses.append(loss)
        self.cumu_losses.append(cumu_loss)

Annoyingly, on this laptop I'm using right now, the loss stays at a very high level for the first 40 epochs, at around 1.8 million, whereas on the PC it had gone down to about 680k by epoch 40. The only difference is the updated loss calculator as above and the number of workers.

Anyway, assuming that's all benign, the good news is that training is still occurring after the reported loss goes to zero:

Loss after epoch 38: 1806704.0 / cumulative loss: 132444312.0  -> epoch took 51.58 s - vector norms min/avg/max: 0.02, 5.17, 13.53
Loss after epoch 39: 1773416.0 / cumulative loss: 134217728.0  -> epoch took 52.38 s - vector norms min/avg/max: 0.02, 5.2, 13.59
Loss after epoch 40: 0.0 / cumulative loss: 134217728.0  -> epoch took 52.39 s - vector norms min/avg/max: 0.02, 5.22, 13.64
Loss after epoch 41: 0.0 / cumulative loss: 134217728.0  -> epoch took 52.34 s - vector norms min/avg/max: 0.02, 5.25, 13.68
Loss after epoch 42: 0.0 / cumulative loss: 134217728.0  -> epoch took 52.27 s - vector norms min/avg/max: 0.02, 5.28, 13.71
Loss after epoch 43: 0.0 / cumulative loss: 134217728.0  -> epoch took 52.21 s - vector norms min/avg/max: 0.02, 5.3, 13.76

Your explanation makes sense - it didn't look like an overflow at first glance, as the loss changes pretty massively between epochs; but of course the cumulative loss is computed with lots of tiny updates, each of which is getting ignored as you've detailed. So that seems to be the cause.

I'm not sure why the training is much slower on the laptop (i.e. the loss is much higher after 40 full epochs), but that probably isn't related to this. In case it matters, here are the versions of things on the laptop:

Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.3 | packaged by conda-forge | (default, Dec  6 2019, 08:36:57) 
[Clang 9.0.0 (tags/RELEASE_900/final)]
NumPy 1.17.5
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 1

Not sure why the NumPy and SciPy versions are slightly different (on Windows they were 1.17.3 and 1.3.1).

gojomo commented 4 years ago

@tsaastam If you try setting model.running_training_loss to 0.0 on each epoch end, does tallying continue to work into your later epochs?

(Also, separate from this bug: that's a lot of epochs! The last few epoch-deltas before the problem already show epoch loss jittering up-and-down; you may already be past the point, possibly far past the point, where more epochs are doing any good.)

tsaastam commented 4 years ago

You're probably right about additional epochs not helping; my concern though is that on the Windows machine the loss drops fairly quickly from about 6 million to around 700k - here between epochs 8 and 9:

Screenshot 2020-01-30 at 13 43 27

(That image is with the bug, so the drop at the end is due to the cumulative loss suddenly dropping to zero - ignore that part.)

Anyway, with the laptop, with the same data, the loss is still staying at around 1.8 million by epoch 39. Of course I now realise it's not quite the same code, since on the laptop I was running the version that measures the magnitude of the word vectors after each epoch... but that shouldn't interfere with the training? I might need to investigate this seeming training discrepancy a bit more, then maybe open a separate issue about it if it seems like a real thing.

On the loss issue: taking your advice and adding model.running_training_loss = 0.0 to the end of the loss calculating method (and removing the cumulative loss stuff as it's no longer needed), the problem indeed seems to be resolved. Here's my loss calc now (the custom cumulative loss isn't really needed except for some debugging convenience):

class MyLossCalculator(CallbackAny2Vec):
    def __init__(self):
        self.epoch = 1
        self.losses = []
        self.cumu_loss = 0.0
        self.previous_epoch_time = time.time()

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        norms = [linalg.norm(v) for v in model.wv.vectors]
        now = time.time()
        epoch_seconds = now - self.previous_epoch_time
        self.previous_epoch_time = now
        self.cumu_loss += float(loss)
        print(f"Loss after epoch {self.epoch}: {loss} (cumulative loss so far: {self.cumu_loss}) "+\
              f"-> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
              f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
        self.epoch += 1
        self.losses.append(float(loss))
        model.running_training_loss = 0.0

And (running on the laptop again), the loss updates now seem to work fine:

Building vocab...
Vocab done. Training model for 100 epochs, with 6 workers...
Loss after epoch 1: 35497984.0 (cumulative loss so far: 35497984.0) -> epoch took 62.76 s - vector norms min/avg/max: 0.02, 1.69, 9.03
Loss after epoch 2: 34189408.0 (cumulative loss so far: 69687392.0) -> epoch took 57.72 s - vector norms min/avg/max: 0.02, 2.17, 9.79
Loss after epoch 3: 33939008.0 (cumulative loss so far: 103626400.0) -> epoch took 59.45 s - vector norms min/avg/max: 0.02, 2.5, 10.18
Loss after epoch 4: 33727968.0 (cumulative loss so far: 137354368.0) -> epoch took 63.11 s - vector norms min/avg/max: 0.02, 2.76, 10.22
Loss after epoch 5: 33630772.0 (cumulative loss so far: 170985140.0) -> epoch took 53.8 s - vector norms min/avg/max: 0.02, 2.97, 10.2
Loss after epoch 6: 33635360.0 (cumulative loss so far: 204620500.0) -> epoch took 61.26 s - vector norms min/avg/max: 0.02, 3.15, 10.08

After epoch 5 there, the accumulated loss is around 170 million, which is of course more than the 134ish million where the problem occurred before. So your workaround is good.

(The loss still goes down much more slowly here than on the Windows PC earlier, as I said I need to investigate that a bit more.)

gojomo commented 4 years ago

It's good to know that running model.running_training_loss = 0.0 at each epoch-end, and thus essentially modifying the model's tally to be per-epoch rather than multi-epoch, improves the issue for you. (It may not be a total fix, as larger datasets may still suffer from imprecise or stalled within-epoch loss tallies from this same limit.)

Not sure what could be causing your other possibly-anomalous behavior – that very-slight loss-improvement in your last "on the laptop again" figures seems fishy, especially for early epochs where the effective alpha is high and the improvement from the original random-intialization of each example/epoch should be large – but I'll let you dig into that, if you have lingering issues, you might want to post as update/new-message to discussion list.

Cartman0 commented 4 years ago

@tsaastam model.running_training_loss.dtype maybe change per calculating model.running_training_loss. can you check type?

gojomo commented 4 years ago

@Cartman0 What do you mean? Are you encountering an error or expecting some efficiency problem? (Behind the scenes, I believe the value in the Python object is being copied-into a C-structure for the Cython code, then copied back out after that code tallies all the tiny errors. So that C-type, still just a 32-bit float, will be most relevant for the tallying behavior. That's probably still too coarse overall for accurate & robust loss-reporting – so this per-epoch reset to 0.0 isn't a general fix for the underlying problem – but helps a bit in this case.)

Cartman0 commented 4 years ago

@gojomo I thought it may have type problem because of using to nest np.sum , np.log, np.special.expit in calculating model.running_training_loss. it don't have type problem?

gojomo commented 4 years ago

@Cartman0 I’m not sure what you mean here by ‘type problem’. What chain-of-operations could lead to a bad result? (It’s possible there’s a problem – as the use of float32 for a large long-running tally demonstrates, this code wasn’t written with the deepest analysis – but I don’t know what error/mismatch is your concern.)

Cartman0 commented 4 years ago

@gojomo sorry to write it confusingly.

I also think some cause (such as Information loss) to iterate calculating 32 bit float, because of max at 2^27. but, I understand python object float is 64 bit precision, so, model.running_training_loss can be 64 bit precision. In calculating model.running_training_loss and chain-of-operations, unintended implicit type conversion happen ?

gojomo commented 4 years ago

@Cartman0 The actual loss computation, and tallying, occurs in the Cython code – which is compiled to C/C++ & has a fixed float32 type, no matter what Python type is used in the Python model.running_training_loss object. For example, it's copied-to a c-structure and then copied-out (& my guess is that the problem in #2743 is caused by multiple threads doing this without regard to each others' interim updates).

Cartman0 commented 4 years ago

@gojomo ok,thx. i'm not familiar with Cython..

functions actually computing loss are w2v_fast_sentence_sg_hs and w2v_fast_sentence_cbow_hs in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec_inner.pyx ?

gojomo commented 4 years ago

@Cartman0 Yes, those 2 functions, and also the negative-sampling versions of those functions, w2v_fast_sentence_sg_neg & w2v_fast_sentence_cbow_neg.

pnezis commented 4 years ago

@gojomo setting model.running_training_loss to 0.0 at an epoch's end seems to affect the training.

Sample callback:

class LossReportCallback(CallbackAny2Vec):
    def __init__(self, reset_loss=False):
        self.epoch = 1
        self.previous_cumulative_loss = 0
        self.reset_loss = reset_loss

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.reset_loss:
            model.running_training_loss = 0.0
        else:
            loss_now = loss - self.previous_cumulative_loss
            self.previous_cumulative_loss = loss
            loss = loss_now

        if self.epoch % 5 == 0:
            logging.info(f'loss after epoch {self.epoch}: {loss}')

        self.epoch += 1

The output if executed with reset_loss=False

INFO loss after epoch 5: 482677.875
INFO loss after epoch 10: 439727.5
INFO loss after epoch 15: 439261.0
INFO loss after epoch 20: 406250.0
INFO loss after epoch 25: 397587.0
INFO loss after epoch 30: 395241.0
INFO loss after epoch 35: 392499.0
INFO loss after epoch 40: 338352.0
INFO loss after epoch 45: 322680.0
INFO loss after epoch 50: 322566.0

The output if executed with reset_loss=True

INFO loss after epoch 5: 499439.3125
INFO loss after epoch 10: 498720.34375
INFO loss after epoch 15: 497191.375
INFO loss after epoch 20: 495352.28125
INFO loss after epoch 25: 501985.4375
INFO loss after epoch 30: 495859.59375
INFO loss after epoch 35: 493470.78125
INFO loss after epoch 40: 493170.25
INFO loss after epoch 45: 492609.6875
INFO loss after epoch 50: 496730.125
gojomo commented 4 years ago

@gojomo setting model.running_training_loss to 0.0 at an epoch's end seems to affect the training.

Yes, that workaround is absolutely expected to change the reported loss numbers, as precision will no longer be lost due to the tally reaching representational extremes. Improving the loss numbers is the whole point of the workaround. You are reporting suggestive evidence that the workaround works.

It shouldn't have any effect on the quality of training results, as this tally (whether for all-epochs or one-epoch) isn't consulted for any model-adjustment steps.

DaikiTanak commented 4 years ago

pnezis reports that epoch-wise loss changes by resetting model.running_training_loss=0 on the end of each epoch. And I get the similar result.

If we do not reset it, model loss continues to decrease and it seems to be a successful training. But if we reset it, model loss stagnates around 49000 in his example.

Doesn't this mean that model training (or parameter updating) is affected by model.running_training_loss? Or, can we get good model in later case (model.running_training_loss=0) even if loss does not decrease from certain value?

gojomo commented 4 years ago

@DaikiTanak - The running loss tally is very buggy without a per-epoch reset. For large enough training sets, it might also suffer precision issues in a single epoch.

But the actual training that happens, on individual (context->word) examples, is the same either way. That's not affected by this running loss tally in any way. Only the reporting-out is changing. And any reported-out tally of aggregate loss is not a measure of model quality, only model 'convergence' (reaching a point where it can't, given its structure/state, be optimized any more). A model with a higher loss tally might be better on real world problems; an embedding model with a 0.0 loss is likely broken (severely overfit).

DaikiTanak commented 4 years ago

@gojomo Thank you for kind explaining. I understand that reported loss can be used to judge model convergence.

For best practice, we should reset running loss like model.running_training_loss = 0 on the end of each epoch, and see epoch-wise losses to judge if model training is converged or not.

word2vec_loss_with_reset_model_running_loss

For example, from the above figure (epoch vs epoch-wise loss with resetting model.running_training_loss=0 ), we can say that the model may converge at epoch 25 or so.

gojomo commented 4 years ago

All I'd say for sure is that resetting each epoch is better than not. As mentioned, large-enough epochs might show the same bug within a single epoch. Not knowing what code/data generated that graph, it's hard for me to endorse any idea of what it means.(The gradual trend 'up' from X=50 to x=200 is suspicious.)

DaikiTanak commented 4 years ago

Thanks @gojomo, the above graph is generated by following codes. I also think gradual trend up is suspicious and may be caused by some bugs.

print("training Word2Vec...") callbacker = callback() model.train( sentence_corpus, epochs=model.iter, total_examples=model.corpus_count, compute_loss=True, callbacks=[callbacker], )

gojomo commented 4 years ago

I don't see any specific reason the per-epoch loss might be trending up, in your code, but a few other notes: (1) for reasons previously alluded to, and the fact that the tally doesn't include the effects of late-in-epoch adjustments on early-in-epoch examples, the model with the lowest end-of-epoch loss tally is not necessarily 'best'; (2) I've never actually .deepcopy() on a Word2Vec model, so there's some small chance that might not yield a truly independent copy; (3) min_count=1 is usually a bad idea with the word2vec algorithm, as low-frequency words don't get good word-vectors themselves and do interfere with other words' improvements.

loveis98 commented 1 year ago

@gojomo @tsaastam @DaikiTanak @pnezis Hi! I have the same question: right from the start of training I get 134217728.0 loss after each epoch (constantly). What decision have been found finally?

gojomo commented 1 year ago

@loveis98 The bugs limiting the usefulness/interpretability of the Word2Vec loss-reporting remain until matters described in #2617 (& related to this & other bugs) truly addresses. The initial work in #2922 might be a starting basis for some real fixes, but there's no one prioritizing/working-on this at the moment, afaik.

Manually resetting the tally to 0.0 before each epoch (which you've not mentioned whether you're yet doing) may help the per-epoch readout work better – but larger runs will still face imprecision/maxing-out risks within a single epoch.