google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.83k stars 9.56k forks source link

Some Questions about vocab? #194

Open delldu opened 5 years ago

delldu commented 5 years ago

Dear friends,

I checked some vocab.txt under uncased_L-12_H-768_A-12, chinese_L-12_H-768_A-12, multilingual_L-12_H-768_A-12 and find some questions:

1. Many ##xxx in vocab.txt, for an example:

  $ cat uncased_L-12_H-768_A-12/vocab.txt | grep "##" | sort |  more
##at
##ata
...
   $ cat uncased_L-12_H-768_A-12/vocab.txt | grep -w "at"
at
##at

My question is: we have "at" in vocab.txt, why needs "##at" ? what does "##at" mean here ?

2. Many numbers are there in vocab.txt, for an example:

  $ cat uncased_L-12_H-768_A-12/vocab.txt | grep 0
    1609
690
1910s
840
1086
...

My question is: digit numbers are unlimited, is it reasonable putting them into vocab.txt?

3. Some common word missing in vocab, for an example: Word "fax" does not exists in uncased_L-12_H-768_A-12/vocab.txt,

  $ cat uncased_L-12_H-768_A-12/vocab.txt | grep fax
halifax
fairfax

Thanks your answer in advance.

zheolong commented 5 years ago

@delldu

  1. xxx means 'xxx' can be part of an unknown single word, for example, a person name 'Hypatia', can be split into pieces 'h', '##yp', '##ati', '##a', you can check that '##yp', '##ati' and '##a' are contained in the vocabulary file.

  2. Same for numbers.
  3. I think this vocabulary is not cover-all, and I do not know which one can cover all.