Closed zanaglio closed 6 years ago
Hi @zanaglio
I think you just need to remove w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin
word2vec model because it's probably left from the previous training and has the inconsistent token dict — you need to retrain it from scratch on your actual corpus (train.py will do this before the main training process).
Killed
may happen in case of out of memory situation, please check if your dataset fits in memory because in the current implementation it loads entirely to RAM before training.
Hi @nsmetanin
Thanks a lot for your answer, I'll work on it !
Hi @zanaglio! Do you still need some help from us?
Hello @rodart ! Unfortunatelly I got this error every time everytime I trained my neural network (maybe it was because I used a french dataset ? I tried with CPU and GPU, with and without touching the parameters, etc.). However, the different generated models worked anyway (I didn't notice anything weird when I spoke with the chatbot). Thank you for your support !
Hey @zanaglio, I'd recommend doing next steps:
1) launch fresh docker container - https://github.com/lukalabs/cakechat/#cpu-only-setup
2) move your training corpus inside container to data/corpora_processed/
. Be sure that your data has structure like here - https://github.com/lukalabs/cakechat/blob/master/data/corpora_processed/train_processed_dialogs.txt
3) prepare index files with python tools/prepare_index_files.py
, make sure that you have correct index files saved in data/condition_index
and data/token_index
4) try to train your model with python tools/train.py
Hi @zanaglio, any luck?
Hello @rodart,
I recently tried to reproduce this problem with a smaller french dataset (~100 dialogs) and it seems to work (I don't have the error Can't find token
anymore).
So, I don't know if the error was due to:
I am still missing something ? (Truth is, I don't think this problem had impacted my training, that's why I don't really understand)
Thanks again for your help :)
Hi @zanaglio,
Most probably this error was because you missed some data preparation step, e.g. running python tools/prepare_index_files.py
on your training data, or it didn't process corpus correctly.
Language and corpus size shouldn't cause this error. However, we tested our code only on English dialogs, so it's possible that there is some bug related to specific French letters.
Let us know when you train cakechat model on French dialog corpus. It would be fun to play with it ;)
Feel free to ask if you need any help.
Hello @rodart ,
Thanks again for your response and time. Don't worry, I didn't forget to call the python tools/prepare_index_files.py
:)
I just thought this morning: could this be because my file val_processed_dialogs.txt
contains word that don't exist in the file train_processed_dialogs.txt
? Or does the indexing step concern only train_processed_dialogs.txt
?
Indexing step related to training corpus only.
But I got an idea why you can get this issue. Do you name your corpus train_processed_dialogs.txt? When you launch train.py, firstly it tries to find w2v index file locally. This index filename depends on training corpus filename. It tries to get this index in the following steps: 1) try to find w2v index file locally in /root/cakechat/data/w2v_models/ 2) If there is no such file, try to fetch it from AWS S3 3) If there is no such on S3, then try to train w2v model using your corpus and then saved it to /root/cakechat/data/w2v_models/
If you haven't changed the training corpus filename, it downloads our pre-trained w2v model from S3. It was trained on English twitter corpus, that's why you can see all these 'Can't find token' warns for French corpus.
Just try to change the name of the training corpus, and try to train the model again
Hello @rodart ,
Thanks again for your help, I’ll check/try again and I’ll keep you up-to-date !
I'm having the same warnings. After running python tools/prepare_index_files.py
w2v_models is still empty so when I run python tools/train.py
it starts looking for the model and if it can't find it, the model gets downloaded from AWS. The thing is, my dialogs are in cyrillic, so almost all the words from token_index
are missing in the w2v model. Is it the expected behavior or am I missing something?
@hurlenko yes, since your are not using English-based vocabulary such behaviour is expected. In order to force w2v model training on your data, delete the following line to prevent fetching from AWS: https://github.com/lukalabs/cakechat/blob/1efee48352caebfa4bda737c7e35de8edab89aab/cakechat/dialog_model/model_utils.py#L224
No changes
Using cuDNN version 5110 on context None
Mapped name None to device cuda0: GeForce GTX 1080 Ti (0000:01:00.0)
[15.02.2019 10:49:38.067][INFO][1722][cakechat.utils.files_utils][87] Loading /root/cakechat/data/tensorboard/steps
[15.02.2019 10:49:38.067][INFO][1722][cakechat.tools/train.py][102] THEANO_FLAGS: floatX=float32,device=cuda0,gpuarray.preallocate=0.0
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][42] Getting train iterator for w2v...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][48] Getting text-filtered train iterator...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.tools/train.py][51] Getting tokenized train iterator...
[15.02.2019 10:49:38.123][INFO][1722][cakechat.utils.w2v.model][64] Getting w2v model
[15.02.2019 10:49:38.124][INFO][1722][cakechat.utils.w2v.model][18] Word2Vec model will be trained now. It can take long, so relax and have fun.
[15.02.2019 10:49:38.124][INFO][1722][cakechat.utils.w2v.model][21] Parameters for training: window10_voc50000_vec128_sgTrue
[15.02.2019 10:49:49.366][INFO][1722][cakechat.utils.w2v.model][44] Saving model to /root/cakechat/data/w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin
[15.02.2019 10:49:49.674][INFO][1722][cakechat.utils.w2v.model][47] Model has been saved
[15.02.2019 10:49:49.674][INFO][1722][cakechat.utils.w2v.model][80] Successfully got w2v model
[15.02.2019 10:49:49.674][INFO][1722][cakechat.dialog_model.model_utils][202] Preparing embedding matrix based on w2v_model and index_to_token dict
[15.02.2019 10:49:49.757][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [_unk_] in w2v dict
[15.02.2019 10:49:49.838][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [ехали] in w2v dict
[15.02.2019 10:49:49.839][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [терпеть] in w2v dict
[15.02.2019 10:49:49.839][WARNING][1722][cakechat.dialog_model.model_utils][192] Can't find token [состоянии] in w2v dict
Hello guys, I've started recently working with cakechat and I'm facing some issues. I've run
prepare_index_files.py
(with a french dataset of ~10 000 dialogs) with no issues.Afterwards, when I run
python tools/train.py
, it continually threw this kind of error:token_index/t_idx_processed_dialogs.json
(it's weird because there's 50 000 words inside, and some of them are found)Here is what my
data/
folder looks like:data/condition_index/c_idx_processed_dialogs.json
corpora_processed/train_processed_dialogs.txt
corpora_processed/train_processed_dialogs.txt
quality/context_free_questions.txt
quality/context_free_test_set.txt
quality/context_free_validation_set.txt
tensorboard/steps
token_index/t_idx_processed_dialogs.json
w2v_models/train_processed_dialogs_window10_voc50000_vec128_sgTrue.bin
Finally, when the step bellow comes, the
train.py
processed is killed:(IS_DEV flag has been set to 0)
Thanks a lot for you help and for all your work !