Cakechat "can't find token"

zanaglio commented 6 years ago

Hello guys, I've started recently working with cakechat and I'm facing some issues. I've run prepare_index_files.py (with a french dataset of ~10 000 dialogs) with no issues.

Afterwards, when I run python tools/train.py, it continually threw this kind of error:

[25.04.2018 15:18:33.495][INFO][15][cakechat.utils.s3.bucket][21] Got file w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin from S3
[25.04.2018 15:18:33.510][INFO][15][cakechat.utils.w2v.model][51] Loading model from /root/cakechat/data/w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin
[25.04.2018 15:18:33.794][INFO][15][cakechat.utils.w2v.model][53] Model "train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin" has been loaded.
[25.04.2018 15:18:33.794][INFO][15][cakechat.utils.w2v.model][80] Successfully got w2v model
[25.04.2018 15:18:33.794][INFO][15][cakechat.dialog_model.model_utils][205] Preparing embedding matrix based on w2v_model and index_to_token dict

[25.04.2018 15:18:33.806][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [ça] in w2v dict
[25.04.2018 15:18:33.806][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [avec] in w2v dict
[25.04.2018 15:18:33.806][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [même] in w2v dict
[25.04.2018 15:18:33.806][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [ils] in w2v dict
[25.04.2018 15:18:33.806][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [être] in w2v dict
[25.04.2018 15:18:33.807][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [suis] in w2v dict
[25.04.2018 15:18:33.807][WARNING][15][cakechat.dialog_model.model_utils][195] Can't find token [quand] in w2v dict
...

I have about 38 521 warning like that.
I've checked, all these tokens are in token_index/t_idx_processed_dialogs.json (it's weird because there's 50 000 words inside, and some of them are found)

Here is what my data/ folder looks like: data/condition_index/c_idx_processed_dialogs.json

corpora_processed/train_processed_dialogs.txt corpora_processed/train_processed_dialogs.txt

quality/context_free_questions.txt quality/context_free_test_set.txt quality/context_free_validation_set.txt

tensorboard/steps

token_index/t_idx_processed_dialogs.json

w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin

Finally, when the step bellow comes, the train.py processed is killed:

...
[25.04.2018 15:20:26.123][INFO][15][cakechat.dialog_model.model][348] Computing train updates...
[25.04.2018 15:22:27.128][INFO][15][cakechat.dialog_model.model][351] Compiling train function...
Killed

(IS_DEV flag has been set to 0)

Thanks a lot for you help and for all your work !

nikitos9000 commented 6 years ago

Hi @zanaglio

I think you just need to remove w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin word2vec model because it's probably left from the previous training and has the inconsistent token dict — you need to retrain it from scratch on your actual corpus (train.py will do this before the main training process).

Killed may happen in case of out of memory situation, please check if your dataset fits in memory because in the current implementation it loads entirely to RAM before training.

zanaglio commented 6 years ago

Hi @nsmetanin

Thanks a lot for your answer, I'll work on it !

rodart commented 6 years ago

Hi @zanaglio! Do you still need some help from us?

zanaglio commented 6 years ago

Hello @rodart ! Unfortunatelly I got this error every time everytime I trained my neural network (maybe it was because I used a french dataset ? I tried with CPU and GPU, with and without touching the parameters, etc.). However, the different generated models worked anyway (I didn't notice anything weird when I spoke with the chatbot). Thank you for your support !

rodart commented 6 years ago

Hey @zanaglio, I'd recommend doing next steps: 1) launch fresh docker container - https://github.com/lukalabs/cakechat/#cpu-only-setup 2) move your training corpus inside container to data/corpora_processed/. Be sure that your data has structure like here - https://github.com/lukalabs/cakechat/blob/master/data/corpora_processed/train_processed_dialogs.txt 3) prepare index files with python tools/prepare_index_files.py, make sure that you have correct index files saved in data/condition_index and data/token_index 4) try to train your model with python tools/train.py

rodart commented 6 years ago

Hi @zanaglio, any luck?

zanaglio commented 6 years ago

Hello @rodart,

I recently tried to reproduce this problem with a smaller french dataset (~100 dialogs) and it seems to work (I don't have the error Can't find token anymore).

So, I don't know if the error was due to:

The language of the dataset (french has a lot of letters with accents (é,è,ç,ô,ù, etc.))
The size of the dataset (too big ?)
The small changes I've made with some files (i think i've only changed the dockerfile with the git repo url, changed the _NON_PENALIZABLETOKENS variable and changed the _offensivephrases.csv file)

I am still missing something ? (Truth is, I don't think this problem had impacted my training, that's why I don't really understand)

Thanks again for your help :)

rodart commented 6 years ago

Hi @zanaglio,

Most probably this error was because you missed some data preparation step, e.g. running python tools/prepare_index_files.py on your training data, or it didn't process corpus correctly.

Language and corpus size shouldn't cause this error. However, we tested our code only on English dialogs, so it's possible that there is some bug related to specific French letters.

Let us know when you train cakechat model on French dialog corpus. It would be fun to play with it ;)

Feel free to ask if you need any help.

zanaglio commented 6 years ago

Hello @rodart ,

Thanks again for your response and time. Don't worry, I didn't forget to call the python tools/prepare_index_files.py :) I just thought this morning: could this be because my file val_processed_dialogs.txt contains word that don't exist in the file train_processed_dialogs.txt ? Or does the indexing step concern only train_processed_dialogs.txt ?

rodart commented 6 years ago

Indexing step related to training corpus only.

But I got an idea why you can get this issue. Do you name your corpus train_processed_dialogs.txt? When you launch train.py, firstly it tries to find w2v index file locally. This index filename depends on training corpus filename. It tries to get this index in the following steps: 1) try to find w2v index file locally in /root/cakechat/data/w2v_models/ 2) If there is no such file, try to fetch it from AWS S3 3) If there is no such on S3, then try to train w2v model using your corpus and then saved it to /root/cakechat/data/w2v_models/

If you haven't changed the training corpus filename, it downloads our pre-trained w2v model from S3. It was trained on English twitter corpus, that's why you can see all these 'Can't find token' warns for French corpus.

Just try to change the name of the training corpus, and try to train the model again

zanaglio commented 6 years ago

Hello @rodart ,

Thanks again for your help, I’ll check/try again and I’ll keep you up-to-date !

hurlenko commented 5 years ago

I'm having the same warnings. After running python tools/prepare_index_files.py w2v_models is still empty so when I run python tools/train.py it starts looking for the model and if it can't find it, the model gets downloaded from AWS. The thing is, my dialogs are in cyrillic, so almost all the words from token_index are missing in the w2v model. Is it the expected behavior or am I missing something?

nicolas-ivanov commented 5 years ago

@hurlenko yes, since your are not using English-based vocabulary such behaviour is expected. In order to force w2v model training on your data, delete the following line to prevent fetching from AWS: https://github.com/lukalabs/cakechat/blob/1efee48352caebfa4bda737c7e35de8edab89aab/cakechat/dialog_model/model_utils.py#L224

hurlenko commented 5 years ago

No changes

Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX 1080 Ti (0000:01:00.0)
[15.02.2019 10:49:38.067][INFO][1722][cakechat.utils.files_utils][87] Loading /root/cakechat/data/tensorboard/steps
[15.02.2019 10:49:38.067][INFO][1722][cakechat.tools/train.py][102] THEANO_FLAGS: floatX=float32,device=cuda0,gpuarray.preallocate=0.0
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][42] Getting train iterator for w2v...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][48] Getting text-filtered train iterator...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][51] Getting tokenized train iterator...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.utils.w2v.model][64] Getting w2v model
[15.02.2019 10:49:38.124][INFO][1722][cakechat.utils.w2v.model][18] Word2Vec model will be trained now. It can take long, so relax and have fun.
[15.02.2019 10:49:38.124][INFO][1722][cakechat.utils.w2v.model][21] Parameters for training: window10_voc50000_vec128_sgTrue
[15.02.2019 10:49:49.366][INFO][1722][cakechat.utils.w2v.model][44] Saving model to /root/cakechat/data/w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin
[15.02.2019 10:49:49.674][INFO][1722][cakechat.utils.w2v.model][47] Model has been saved
[15.02.2019 10:49:49.674][INFO][1722][cakechat.utils.w2v.model][80] Successfully got w2v model

[15.02.2019 10:49:49.674][INFO][1722][cakechat.dialog_model.model_utils][202] Preparing embedding matrix based on w2v_model and index_to_token dict
[15.02.2019 10:49:49.757][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [_unk_] in w2v dict
[15.02.2019 10:49:49.838][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [ехали] in w2v dict
[15.02.2019 10:49:49.839][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [терпеть] in w2v dict
[15.02.2019 10:49:49.839][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [состоянии] in w2v dict

lukalabs / cakechat

Cakechat "can't find token" #19