what is the use of word2index

rafiepour / CTran

Complete code for the proposed CNN-Transformer model for natural language understanding.

https://github.com/rafiepour/CTran

Apache License 2.0

23 stars 2 forks source link

what is the use of word2index #1

Closed changsha2999 closed 9 months ago

changsha2999 commented 9 months ago

Hi Rafiepour， Thanks for your great job！ I have a small question of the use of word2index，seems real ids（input_ids） that produced by bert_tokenizer, can i remove the word2index.

thanks!

rafiepour commented 9 months ago

Hi @changsha2999 , The code has gone through many changes. In the past, we used word2vec as an embedding to get a baseline, so the word2index dictionary was one of the centerpieces of the word to index map. However, in the current version accessible to you, _it is only used in the true mask generation, where a mask with a one to one co-relation to the output tokens is generated ( see ln[9], self.x_mask and self._sin ) So in short, yes it can be removed, although you have to change this section to prevent any errors.

changsha2999 commented 9 months ago

Hi Rafiepour， Thanks your answer very much. that's great help for me!

Another small question: the 'first slot label' on the dataset is for the "BOS": BOS does us air fly from ... O O B-airline_name I-airline_name O O ...

that is why on the code: dataset = [[t[0][1:-1], t[1][1:], t[2]] for t in dataset] -----> t[1][1:] to remove the 'first slot label'
am i right?

Thanks again.

rafiepour commented 9 months ago

Yes, as i commented above the said line: #removes BOS, EOS from array of tokens and tags These tokens should not be counted in the f1 score because they will incorrectly increase the score. To keep things organized, if you have any other questions, feel free to create a new issue.