Closed lzp-man closed 2 years ago
For efficiency reasons, we keep two dictionaries:
big_stats
, which is the full collection of symbol pairsstats
, which is a pruned version of big_stats
that initially only contains the most frequent symbol pairs, but is regularly synced with big_stats
. Most operations (updating symbol pair frequencies; finding the most frequent one) are done on stats
.if a symbol pair is not among the the most frequent pairs, it may be pruned from stats
(and the default frequency is 0), but the frequency of a symbol pair will actually decrease as neighbouring symbols are merged (if you have "a b c", and "b c" are merged, then the frequency of "a b" decreases). That's how frequencies in stats
can become negative, and why the code you're quoting makes sure to update big_stats
correspondingly.
In learn_bpe.py, the function prune_stats has code as follow: for item,freq in list(stats.items()): if freq < threshold: del stats[item] if freq < 0: big_stats[item] += freq else: big_stats[item] = freq I want to ask why the freq can bellow zero? This conditional judgment is for what?