stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
482 stars 42 forks source link

Vocabs Genration #16

Closed IssaIssa1 closed 4 years ago

IssaIssa1 commented 4 years ago

Hello, I am trying to generate vocabs to train electra model. I am using the following code

from tokenizers import BertWordPieceTokenizer

# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=True,
)
# prepare text files to train vocab on them
files = ['data.txt']

tokenizer.train(
  files,
  vocab_size=100000,
  min_frequency=2,
  show_progress=True,
  #special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
  limit_alphabet=1000,
  wordpieces_prefix="##"
)
tokenizer.save('vocabs.txt')

When I use tokenizer.save('./') I got Exception: Is a directory (os error 21) When I save it as above code, when I run build_pretraining_dataset.py I code this error. I suspect that there's something wrong with the vocabs format .
output.append(vocab[item]) KeyError: '[UNK]'
What do you think is missing?

stefan-it commented 4 years ago

Hi @IssaIssa1 ,

In your example you need to un-comment the special_tokens line (because these token are really needed, and the last line should be tokenizer.save("./vocab.txt"). I tested it with the latest 0.8.0 version of tokenizers :)

IssaIssa1 commented 4 years ago

Thank you, I appreciate your help. I generated the vocab.txt as recommended. But I still get the same error. I am running the following command where I have in corpus folder txt file with 300k+ data point each on a line.

python3 build_pretraining_dataset.py --corpus-dir=./data/corpus --vocab-file=./data/vocabs/vocab.txt --output-dir=./output_data

I am getting this error.

Job 0: Creating example writer
Job 0: Writing tf examples
Traceback (most recent call last):
  File "build_pretraining_dataset.py", line 230, in <module>
    main()
  File "build_pretraining_dataset.py", line 218, in main
    write_examples(0, args)
  File "build_pretraining_dataset.py", line 190, in write_examples
    example_writer.write_examples(os.path.join(args.corpus_dir, fname))
  File "build_pretraining_dataset.py", line 143, in write_examples
    example = self._example_builder.add_line(line)
  File "build_pretraining_dataset.py", line 50, in add_line
    bert_tokids = self._tokenizer.convert_tokens_to_ids(bert_tokens)
  File "/tf/ar_nlp/ar_albert/electra/model/tokenization.py", line 130, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)
  File "/tf/ar_nlp/ar_albert/electra/model/tokenization.py", line 91, in convert_by_vocab
    output.append(vocab[item])
KeyError: '[UNK]'
stefan-it commented 4 years ago

Could you manually check for the unknown-token via:

$ grep "\[UNK\]" ./data/vocabs/vocab.txt
IssaIssa1 commented 4 years ago

I got this, It is noticeable that it creates pretrain_data.tfrecord-0-of-1000 files in the output directory but with 0 B size. One note, I am working on arabic text. image

stefan-it commented 4 years ago

But this is definitely the wrong vocab format (seems to be byte-level BPE instead of word piece) -> could you verify that you're using the correct vocab.

Depending on the number of processes and amount of training data some of the tfrecord files could be empty. But you should check the directory size (e.g. with du -sh /output_data) so see if data were written :)

IssaIssa1 commented 4 years ago

I am using BertWordPieceTokenizer shouldn't be work piece based? and I got this for the output 72K ./output_data I don't think the data was written. May you please share how the vocab.txt should look like?

from tokenizers import BertWordPieceTokenizer

# Initialize an empty BERT tokenizer
tokenizer = BertWordPieceTokenizer(
  clean_text=False,
  handle_chinese_chars=False,
  strip_accents=False,
  lowercase=True,)

# prepare text files to train vocab on them
files = [data.txt']
# train BERT tokenizer
tokenizer.train(
  files,
  vocab_size=100000,
  min_frequency=2,
  show_progress=True,
  special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
  limit_alphabet=1000,
  wordpieces_prefix="##"
)
stefan-it commented 4 years ago

No problem, here's an example vocab:

https://cdn.huggingface.co/dbmdz/bert-base-turkish-cased/vocab.txt

The first lines include the special tokens. I also think that no data was written 🤔

IssaIssa1 commented 4 years ago

I saved the file as json then I saved the vocabs into vocab.txt. It worked. Thank you.