Closed elyesmanai closed 4 years ago
Hi @elyesmanai, do you train with TPU?
Then thevocab.txt
file needs to be stored on your Google Bucket, e.g. located under gs://<bucket-name>/vocab.txt
(if you didn't change the path in configure_pretraining.py
).
hello, yes i'm using tpus and put everything in cloud bucket and still got that
Could you post the output of grep "\]$" vocab.txt
or could you point to the vocab file that you've found in the repo?
I've already trained some models with own vocab and pre-training was always working 🤔
the output has been humongous so I changed it to grep "\SEP]$" and I got this
turns out I was not changing the link to the vocab which has to be done in the configure_pretrain.py file. changed it and it works
when running run_pretraining.py I get this error before it pretrains:
================================================================================ Running training
2020-04-28 04:43:55.132186: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:356] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created. ERROR:tensorflow:Error recorded from training_loop: '[SEP]' Traceback (most recent call last): File "run_pretraining.py", line 384, in
main()
.
(lines ignored because they're not useful)
.
File "/home/manai_elye2s/pretrain/electra/pretrain/pretrain_helpers.py", line 121, in _get_candidates_mask
ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]
KeyError: '[SEP]'
I got this both with my own vocab and the default one I downloaded from this repo. In both vocab.txt files there are the [SEP] [CLS] and [MASK] tokens, without space