Open SinclairHudson opened 6 years ago
I am also facing the same issue. It's searching for train.dec, train.enc, test.enc etc.
@SinclairHudson @vivek9237 Gitignore left some things out. The step by step:
In the C:\Users\Sinclair\Desktop\tensorflow_chatbot-master\ folder, make a new folder named "data". (or wherever your tensorflow_chatbot folder is)
In the new "data" folder, take all the *.txt files from the Cornell training material and dump them in.
Then you need to grab the 'prepare_data.py' file from https://github.com/suriyadeepan/datasets.git, which is in the 'datasets/seq2seq/cornell_movie_corpus/scripts' directory and also add that to the 'data' folder and run it. That should automagically make the files you're in need of in the folder it's supposed to be in.
If you leave the files in separate folders, you'll continue to get errors thrown at you because prepare_data.py won't be able to locate the *.txt files (I experienced this myself...at great length...until I wanted to strangle a defenseless puppy).
I'm also afraid to say this is as far as I have gotten. As soon as I corrected this issue, I've now got a new issue arriving. `>> Mode : train
Preparing data in working_dir/
Tokenizing data in data/train.enc
Tokenizing data in data/train.dec
Tokenizing data in data/test.enc
Tokenizing data in data/test.enc
2017-08-07 02:30:44.988705: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use SSE instructions, but these are available on your machine
and could speed up CPU computations.
2017-08-07 02:30:44.989205: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use SSE2 instructions, but these are available on your machin
e and could speed up CPU computations.
2017-08-07 02:30:44.989705: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use SSE3 instructions, but these are available on your machin
e and could speed up CPU computations.
2017-08-07 02:30:44.990705: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use SSE4.1 instructions, but these are available on your mach
ine and could speed up CPU computations.
2017-08-07 02:30:44.992706: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use SSE4.2 instructions, but these are available on your mach
ine and could speed up CPU computations.
2017-08-07 02:30:44.994206: W c:\tf_jenkins\home\workspace\release-win\m\windows
\py\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library
wasn't compiled to use AVX instructions, but these are available on your machine
and could speed up CPU computations.
Creating 3 layers of 256 units.
Traceback (most recent call last):
File "execute.py", line 319, in
Replace the following lines
self.outputs, self.losses = tf.nn.seq2seq.model_with_buckets(
with
self.outputs, self.losses = tf.contrib.legacy_seq2seq.model_with_buckets(
Did that.
Still leads directly into another brick wall.
Traceback (most recent call last):
File "execute.py", line 319, in
With the amount of issues this project gets trying to run it with an up-to-date version of Tensorflow, I feel it would probably be better for an amateur (like myself) to look to this project more as a broken example and rebuild it using this as a map, of sorts. See what needs to be called on and then look up the Tensorflow documentation to write the correct code. I don't foresee myself trying to continue slamming my head against a wall trying to figure out how to proverbially ride backwards on a bicycle. The tensorflow library is going to keep progressing, so rather than trying to regress, might as well use it as a template to make an updated model.
@Crakkerjakked I feel the same way. Have you made much progress on updating the dependancies of this repo?
Has anyone get to run this program without errors?
I'm trying to get this working as well, and in addition trying to understand what is actually going on here...
In my attempt to understand what that prepare_data.py
file is supposed to be doing...
I've tried to clean it up and add more comments so that the actions are more clear. (Still not sure exactly what's going on...)
https://github.com/monkut/cornell-movie-corpus-processor/blob/master/process.py
Any comments are welcome!
Finally got everything to seemingly run, but the resulting test output was crap...
python execute.py
>> Mode : test
WARNING:tensorflow:From tensorflow_chatbot/seq2seq_model.py:174 in __init__.: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
Reading model parameters from working_dir/seq2seq.ckpt-10200
> Where are you from?
_UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK
> Are you trained?
_UNK _UNK _UNK !
> I like pizza, how about you?
_UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK _UNK
Gotchas:
I'm running this with python3.6, so I needed to update some of the text handling imports.
If I can get this working I'd issue a PR.... but not quite there yet....
Anyway, here's my cleanup:
i am not able to run your code @monkut
my output is
Mode : train
Preparing data in working_dir/ Creating vocabulary working_dir/vocab20000.enc from data/train.enc processing line 100000
Full Vocabulary Size : 45604
Vocab Truncated to: 20000 Creating vocabulary working_dir/vocab20000.dec from data/train.dec processing line 100000 Full Vocabulary Size : 44343 Vocab Truncated to: 20000 Tokenizing data in data/train.enc Traceback (most recent call last): File "execute.py", line 352, in
train() File "execute.py", line 138, in train gConfig['dec_vocab_size']) File "C:\Users\aamis\Desktop\tensorflow_chatbot\data_utils.py", line 141, in prepare_custom_data data_to_token_ids(train_enc, enc_train_ids_path, enc_vocab_path, tokenizer) File "C:\Users\aamis\Desktop\tensorflow_chatbot\data_utils.py", line 126, in data_to_token_ids normalize_digits) File "C:\Users\aamis\Desktop\tensorflow_chatbot\data_utils.py", line 108, in sentence_to_token_ids words = tokenizer(sentence) File "C:\Users\aamis\Desktop\tensorflow_chatbot\data_utils.py", line 51, in basic_tokenizer words = re.split(_WORD_SPLIT, space_separated_fragment) File "C:\Users\aamis\AppData\Local\Programs\Python\Python35\lib\re.py", line 203, in split return _compile(pattern, flags).split(string, maxsplit) TypeError: cannot use a string pattern on a bytes-like object
Simple. Try this first instead of the above. If it doesn't work. To run the prepare.py . Just use the Python 2 interpreter instead of Python 3 to execute . This fixed the problem that I was facing.
which code @lohith-emplay ?
This code
import random
'''
1. Read from 'movie-lines.txt'
2. Create a dictionary with ( key = line_id, value = text )
'''
def get_id2line():
lines=open('movie_lines.txt').read().split('\n')
id2line = {}
for line in lines:
_line = line.split(' +++$+++ ')
if len(_line) == 5:
id2line[_line[0]] = _line[4]
return id2line
'''
1. Read from 'movie_conversations.txt'
2. Create a list of [list of line_id's]
'''
def get_conversations():
conv_lines = open('movie_conversations.txt').read().split('\n')
convs = [ ]
for line in conv_lines[:-1]:
_line = line.split(' +++$+++ ')[-1][1:-1].replace("'","").replace(" ","")
convs.append(_line.split(','))
return convs
'''
1. Get each conversation
2. Get each line from conversation
3. Save each conversation to file
'''
def extract_conversations(convs,id2line,path=''):
idx = 0
for conv in convs:
f_conv = open(path + str(idx)+'.txt', 'w')
for line_id in conv:
f_conv.write(id2line[line_id])
f_conv.write('\n')
f_conv.close()
idx += 1
'''
Get lists of all conversations as Questions and Answers
1. [questions]
2. [answers]
'''
def gather_dataset(convs, id2line):
questions = []; answers = []
for conv in convs:
if len(conv) %2 != 0:
conv = conv[:-1]
for i in range(len(conv)):
if i%2 == 0:
questions.append(id2line[conv[i]])
else:
answers.append(id2line[conv[i]])
return questions, answers
'''
We need 4 files
1. train.enc : Encoder input for training
2. train.dec : Decoder input for training
3. test.enc : Encoder input for testing
4. test.dec : Decoder input for testing
'''
def prepare_seq2seq_files(questions, answers, path='',TESTSET_SIZE = 30000):
# open files
train_enc = open(path + 'train.enc','w')
train_dec = open(path + 'train.dec','w')
test_enc = open(path + 'test.enc', 'w')
test_dec = open(path + 'test.dec', 'w')
# choose 30,000 (TESTSET_SIZE) items to put into testset
test_ids = random.sample([i for i in range(len(questions))],TESTSET_SIZE)
for i in range(len(questions)):
if i in test_ids:
test_enc.write(questions[i]+'\n')
test_dec.write(answers[i]+ '\n' )
else:
train_enc.write(questions[i]+'\n')
train_dec.write(answers[i]+ '\n' )
if i%10000 == 0:
print '\n>> written %d lines' %(i)
# close files
train_enc.close()
train_dec.close()
test_enc.close()
test_dec.close()
####
# main()
####
id2line = get_id2line()
print '>> gathered id2line dictionary.\n'
convs = get_conversations()
print '>> gathered conversations.\n'
questions, answers = gather_dataset(convs,id2line)
print questions[:2]
print '>> gathered questions and answers.\n'
prepare_seq2seq_files(questions,answers)
``` #i @Crakkerjakked
OK. I am getting this. What now?
(tensorflow_env) C:\Users\DELL\Desktop\cbb>python execute.py
Mode : train
Preparing data in working_dir/
Tokenizing data in data/train.enc
Traceback (most recent call last):
File "execute.py", line 319, in
has anyone found a way to get this to run? I'm getting this errorr?
Traceback (most recent call last):
File "execute.py", line 352, in
This is the error message I get, and from what I can tell I'm missing a folder "data" that contains a training set. Is there a specific way I have to create this? I tried creating my own data folder and plopped the vocab2000 files into it, renaming them to train.dec and train.enc, but that just gave me a different error.
Any advice would be much appreciated!