thu-coai / KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Apache License 2.0
462 stars 62 forks source link

Cannot find chinese_wwm_pytorch #17

Closed KristenZHANG closed 3 years ago

KristenZHANG commented 3 years ago

Hi, I encountered a problem when running the code in benchmark/bertret and want to seek your help. It seems that the 'chinese_wwm_pytorch' cannot be found, including all related files (/vocab.txt, /added_tokens.json etc.):

Screenshot 2021-06-22 093836

Another weird problem is when running ./train_film(music, travel) in other models except bert, like LM andseq2seq, I always encountered the segmentation fault and didnt figure out the reason: image

Thanks so much!

chujiezheng commented 3 years ago
  1. You should prepare the pretrained model, such as from https://huggingface.co/hfl/chinese-bert-wwm-ext
  2. I am not sure the causes. Maybe you can use another python version like 3.6. I guess that the different versions of libraries (such as pytorch and tensorflow) may also lead to the problem
KristenZHANG commented 3 years ago

Hi chujie, thanks so much for your reply.

The problem2 has been solved by changing the python version to 3.6.

For problem1, I tried to download the BERT-wwm-ext from the link and it only contains 3 files: bert_config.json, pytorch_model.bin and vocab.txt image image and after downloading it, when I tried to run train_film.sh in bertret folder, it shows the following: image

I want to ask 1) is it normal that added_tokens.json, special_tokens_map.json, tokenizer_config.json are not found? 2) there is no chinese_stop_words.txt (resources folder) in data, did I miss any step?

Thanks so much!

chujiezheng commented 3 years ago
  1. It is normal. The new version of transformers will try to load these files, but the earlier models do not have these files. So feel free to ignore the warnings.
  2. You can download a stop word list like this: https://github.com/goto456/stopwords