Subtract characters - Githubissues

phikoehn commented 6 years ago

Below a feature suggestion path.

The number of operations specified with -s leads to very different vocabulary sizes, due to the number of unique characters to start with. A value of 49500 creates small vocabularies for language that use Latin alphabet, but easily 70000 or so for Chinese. So, it would be good to subtract the number of unique characters from the number of symbols being generated.

+++ b/subword_nmt/learn_bpe.py @@ -56,6 +56,9 @@ def create_parser(subparsers=None): parser.add_argument('--dict-input', action="store_true", help="If set, input file is interpreted as a dictionary where each line contains a word-count pair") parser.add_argument(

'--total-symbols', '-t', action="store_true",
help="subtract number of characters from the symbols to be generated")
parser.add_argument( '--verbose', '-v', action="store_true", help="verbose mode.")

@@ -197,7 +200,7 @@ def prune_stats(stats, big_stats, threshold): big_stats[item] = freq

-def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False): +def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False, total_symbols=False): """Learn num_symbols BPE operations from vocabulary, and write to outfile. """

@@ -211,6 +214,16 @@ def learn_bpe(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_d

 stats, indices = get_pair_statistics(sorted_vocab)
 big_stats = copy.deepcopy(stats)

+

if total_symbols:
uniq_char = defaultdict(int)
for word in vocab:
prev_char = word[0]
for char in word[1:]:
uniq_char[char] += 1
print('Number of characters: {0}'.format(len(uniq_char)))
num_symbols -= len(uniq_char)
threshold is inspired by Zipfian assumption, but should only affect speed

threshold = max(stats.values()) / 10 for i in range(num_symbols): @@ -270,4 +283,4 @@ if name == 'main': if args.output.name != '': args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)
learn_bpe(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input, total_symbols=args.total_symbols)

rsennrich commented 6 years ago

Hi Philipp, thanks for the suggestion. Any reason why you ignore word[0] for the calculation?

Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.

phikoehn commented 6 years ago

Hi,

thanks for the suggestion. Any reason why you ignore word[0] for the

calculation?

no, I just copied the code from the pairwise statistics, so this should be updated.

Also, every character can be word-internal or word-final, resulting in two possible subword symbols: "a@@" and "a". I'd take this into account in a patch.

OK. I did not think of that.

-phi

rsennrich commented 6 years ago

thanks; this features has been added in commit 61ad855cf0d7043ad9fb3e448be0865d94d06e4f.

rsennrich / subword-nmt

Subtract characters #54

threshold is inspired by Zipfian assumption, but should only affect speed