The first column of `dict.txt` is equivalent to the vocabulary learned from fastBPE?

guolinke / TUPE

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.

MIT License

249 stars 26 forks source link

The first column of `dict.txt` is equivalent to the vocabulary learned from fastBPE? #16

Closed huberemanuel closed 3 years ago

huberemanuel commented 3 years ago

In other words, can I use dict.txt first column as the vocabulary and their respective line number as the id for projecting the input tokens to the embedding layer?

guolinke commented 3 years ago

yeah, the first column is the subword. However, you need to split words into subwords, this process is done by fastBPE. Then the text data can map to subwords, and then you can use line numbers for embedding layer.

huberemanuel commented 3 years ago

Perfect, thank you!