rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.19k stars 465 forks source link

Subtract characters #54

Closed phikoehn closed 6 years ago

phikoehn commented 6 years ago

Below a feature suggestion path.

The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.

+++ b/subword_nmt/learn_bpe.py @@ -56,6 +56,9 @@ def create_parser(subparsers=None): parser.add_argument('--dict-input', action="store_true", help="If set, input file is interpreted as a dictionary where each line contains a word-count pair") parser.add_argument(

@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold): big_stats[item] = freq

-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False): +def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False): """Learn num_symbols BPE operations from vocabulary, and write to outfile. """

@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d

 stats, indices = get_pair_statistics(sorted_vocab)
 big_stats = copy.deepcopy(stats)

+

rsennrich commented 6 years ago

Hi Philipp, thanks for the suggestion. Any reason why you ignore word[0] for the calculation?

Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.

phikoehn commented 6 years ago

Hi,

thanks for the suggestion. Any reason why you ignore word[0] for the

calculation?

no, I just copied the code from the pairwise statistics, so this should be updated.

Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.

OK. I did not think of that.

-phi

rsennrich commented 6 years ago

thanks; this features has been added in commit 61ad855cf0d7043ad9fb3e448be0865d94d06e4f.