Closed huberemanuel closed 3 years ago
yeah, the first column is the subword. However, you need to split words into subwords, this process is done by fastBPE. Then the text data can map to subwords, and then you can use line numbers for embedding layer.
Perfect, thank you!
In other words, can I use
dict.txt
first column as the vocabulary and their respective line number as the id for projecting the input tokens to the embedding layer?