google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

KeyError: '[SEP]' #50

Closed elyesmanai closed 4 years ago

elyesmanai commented 4 years ago

when running run_pretraining.py I get this error before it pretrains:

================================================================================ Running training

2020-04-28 04:43:55.132186: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:356] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created. ERROR:tensorflow:Error recorded from training_loop: '[SEP]' Traceback (most recent call last): File "run_pretraining.py", line 384, in main() . (lines ignored because they're not useful) . File "/home/manai_elye2s/pretrain/electra/pretrain/pretrain_helpers.py", line 121, in _get_candidates_mask ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]] KeyError: '[SEP]'

I got this both with my own vocab and the default one I downloaded from this repo. In both vocab.txt files there are the [SEP] [CLS] and [MASK] tokens, without space

stefan-it commented 4 years ago

Hi @elyesmanai, do you train with TPU?

Then thevocab.txt file needs to be stored on your Google Bucket, e.g. located under gs://<bucket-name>/vocab.txt (if you didn't change the path in configure_pretraining.py).

elyesmanai commented 4 years ago

hello, yes i'm using tpus and put everything in cloud bucket and still got that

stefan-it commented 4 years ago

Could you post the output of grep "\]$" vocab.txt or could you point to the vocab file that you've found in the repo?

I've already trained some models with own vocab and pre-training was always working 🤔

elyesmanai commented 4 years ago

the output has been humongous so I changed it to grep "\SEP]$" and I got this image

elyesmanai commented 4 years ago

turns out I was not changing the link to the vocab which has to be done in the configure_pretrain.py file. changed it and it works