rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.19k stars 465 forks source link

learn_bpe.py code question #116

Closed lzp-man closed 2 years ago

lzp-man commented 2 years ago

In learn_bpe.py, the function prune_stats has code as follow: for item,freq in list(stats.items()): if freq < threshold: del stats[item] if freq < 0: big_stats[item] += freq else: big_stats[item] = freq I want to ask why the freq can bellow zero? This conditional judgment is for what?

rsennrich commented 2 years ago

For efficiency reasons, we keep two dictionaries:

if a symbol pair is not among the the most frequent pairs, it may be pruned from stats (and the default frequency is 0), but the frequency of a symbol pair will actually decrease as neighbouring symbols are merged (if you have "a b c", and "b c" are merged, then the frequency of "a b" decreases). That's how frequencies in stats can become negative, and why the code you're quoting makes sure to update big_stats correspondingly.