Cannot find chinese_wwm_pytorch

thu-coai / KdConv

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation

Apache License 2.0

462 stars 62 forks source link

Cannot find chinese_wwm_pytorch #17

Closed KristenZHANG closed 3 years ago

KristenZHANG commented 3 years ago

Hi, I encountered a problem when running the code in benchmark/bertret and want to seek your help. It seems that the 'chinese_wwm_pytorch' cannot be found, including all related files (/vocab.txt, /added_tokens.json etc.):

Another weird problem is when running ./train_film(music, travel) in other models except bert, like LM andseq2seq, I always encountered the segmentation fault and didnt figure out the reason:

Thanks so much!

chujiezheng commented 3 years ago

You should prepare the pretrained model, such as from https://huggingface.co/hfl/chinese-bert-wwm-ext
I am not sure the causes. Maybe you can use another python version like 3.6. I guess that the different versions of libraries (such as pytorch and tensorflow) may also lead to the problem

KristenZHANG commented 3 years ago

Hi chujie, thanks so much for your reply.

The problem2 has been solved by changing the python version to 3.6.

For problem1, I tried to download the BERT-wwm-ext from the link and it only contains 3 files: bert_config.json, pytorch_model.bin and vocab.txt and after downloading it, when I tried to run train_film.sh in bertret folder, it shows the following:

I want to ask 1) is it normal that added_tokens.json, special_tokens_map.json, tokenizer_config.json are not found? 2) there is no chinese_stop_words.txt (resources folder) in data, did I miss any step?

Thanks so much!

chujiezheng commented 3 years ago

It is normal. The new version of transformers will try to load these files, but the earlier models do not have these files. So feel free to ignore the warnings.
You can download a stop word list like this: https://github.com/goto456/stopwords