google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

Convert SentencePiece .vocab format to OpenNMT-py .onmt_vocab format #1016

Closed HURIMOZ closed 3 months ago

HURIMOZ commented 3 months ago

Hi, Iʻm looking for a Python script to convert SentencePiece .vocab files to OpenNMT-py .onmt_vocab format. SentencePiece vocab files have negative values and we need positive values for the frequency of the words when using OpenNMT-py.

taku910 commented 3 months ago

The numerical values in the vocab files are the score of each token. The definition depends on the model format.

exp(log prob) will be roughly equivalent to the frequency or occurrence prob.