google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.09k stars 1.16k forks source link

what does the second column (numeral column) mean? #213

Closed guokeda closed 5 years ago

guokeda commented 5 years ago

Hi, I used the SentencePiece with uni-gram algorithm to achieve segmentation of protein sequence. The result is two columns data. I know the first column is subword segmentation. But what does the second column (numeral column) mean? Partial results are shown in the bottom. I really and sincerely appreciate for your help.

TN -6.12931 GE -6.13264 LD -6.14611 TS -6.15723 SG -6.16908 SQ -6.17167 DD -6.17356 VA -6.17699 ID -6.17975 PL -6.18626 FK -6.19728 KQ -6.20093 LA -6.20492 SE -6.20776 NS -6.20806 TV -6.20955 NF -6.21059 KI -6.23107 VP -6.23211 KE -6.23277

Best Regard, YB!

taku910 commented 5 years ago

Sorry for the late response.

I'm not sure what this "result" means, but assuming that it is *.vocab file, first column is the piece and the second column is the log probability. eq 6. in http://aclweb.org/anthology/P18-1007

However, the vocab file is not used in the actual segmentation. You might need to use spm_encode to segment an arbitrary text.

taku910 commented 5 years ago

Let me close this issue as we have no update. Please feel free to reopen it if necessary.

guokeda commented 5 years ago

Let me close this issue as we have no update. Please feel free to reopen it if necessary.

Thank you very much for your reply.