learn_bpe.py code question

For efficiency reasons, we keep two dictionaries:

big_stats, which is the full collection of symbol pairs
stats, which is a pruned version of big_stats that initially only contains the most frequent symbol pairs, but is regularly synced with big_stats. Most operations (updating symbol pair frequencies; finding the most frequent one) are done on stats.

if a symbol pair is not among the the most frequent pairs, it may be pruned from stats (and the default frequency is 0), but the frequency of a symbol pair will actually decrease as neighbouring symbols are merged (if you have "a b c", and "b c" are merged, then the frequency of "a b" decreases). That's how frequencies in stats can become negative, and why the code you're quoting makes sure to update big_stats correspondingly.

rsennrich / subword-nmt

learn_bpe.py code question #116