rsennrich / subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
MIT License
2.18k stars 464 forks source link

Improve library usability #52

Closed lfurrer closed 6 years ago

lfurrer commented 6 years ago

When using subword-nmt as a Python library rather than a script, calling BPE.segment() might result in unnecessary string operations.

Consider the following situation, where the user already has a list of tokens and needs a list of segments:

sentence = ' '.join(tokens)
segments = bpe.segment(sentence)
segments = segments.split(' ')

... and inside BPE.segments(), the reverse of these operations happens on the edges, ie. sentence is first split on whitespace, and the segments list is joined to a string before returning.

This pull request adds a new method, BPE.segment_tokens(), which accepts an iterable of tokens and returns a list of segments, while leaving the current API unchanged. This allows avoiding superfluous string operations in the described secenario.

rsennrich commented 6 years ago

thanks! pulled.