... and inside BPE.segments(), the reverse of these operations happens on the edges, ie. sentence is first split on whitespace, and the segments list is joined to a string before returning.
This pull request adds a new method, BPE.segment_tokens(), which accepts an iterable of tokens and returns a list of segments, while leaving the current API unchanged. This allows avoiding superfluous string operations in the described secenario.
When using
subword-nmt
as a Python library rather than a script, callingBPE.segment()
might result in unnecessary string operations.Consider the following situation, where the user already has a list of tokens and needs a list of segments:
... and inside
BPE.segments()
, the reverse of these operations happens on the edges, ie.sentence
is first split on whitespace, and the segments list is joined to a string before returning.This pull request adds a new method,
BPE.segment_tokens()
, which accepts an iterable of tokens and returns a list of segments, while leaving the current API unchanged. This allows avoiding superfluous string operations in the described secenario.