Closed phikoehn closed 6 years ago
Hi Philipp, thanks for the suggestion. Any reason why you ignore word[0] for the calculation?
Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.
Hi,
thanks for the suggestion. Any reason why you ignore word[0] for the
calculation?
no, I just copied the code from the pairwise statistics, so this should be updated.
Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.
OK. I did not think of that.
-phi
thanks; this features has been added in commit 61ad855cf0d7043ad9fb3e448be0865d94d06e4f.
Below a feature suggestion path.
The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.
+++ b/subword_nmt/learn_bpe.py @@ -56,6 +56,9 @@ def create_parser(subparsers=None): parser.add_argument('--dict-input', action="store_true", help="If set, input file is interpreted as a dictionary where each line contains a word-count pair") parser.add_argument(
@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold): big_stats[item] = freq
-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False): +def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False): """Learn num_symbols BPE operations from vocabulary, and write to outfile. """
@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d
+
if total_symbols:
uniq_char = defaultdict(int)
for word in vocab:
prev_char = word[0]
for char in word[1:]:
uniq_char[char] += 1
print('Number of characters: {0}'.format(len(uniq_char)))
num_symbols -= len(uniq_char)
threshold is inspired by Zipfian assumption, but should only affect speed
threshold = max(stats.values()) / 10 for i in range(num_symbols): @@ -270,4 +283,4 @@ if name == 'main': if args.output.name != '':
args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input, total_symbols=args.total_symbols)