udibr / headlines

Automatically generate headlines to short articles
MIT License
526 stars 150 forks source link

Getting issue on running In [69] #3

Open thirumalaipm opened 8 years ago

thirumalaipm commented 8 years ago

Hi,

Thanks for this project. I am trying this scripts and I could process fine with vocabulary-embedding and train scripts. When I tried predict, I am facing error on line In [69]. The error is as follows -


TypeError Traceback (most recent call last)

in () ----> 1 samples = gensamples(X, avoid=avoid, avoid_score=.1, skips=2, batch_size=batch_size, k=10, temperature=1.) in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk) 21 avoid = [a.split() if isinstance(a,str) else a for a in avoid] 22 avoid = [vocab_fold([w if isinstance(w,int) else word2idx[w] for w in a]) ---> 23 for a in avoid] 24 25 print 'HEADS:' TypeError: 'numpy.int64' object is not iterable Please let me know if you need any information on this to debug.
udibr commented 8 years ago

The avoid parameter is not very useful so I suggest to ignore it and always pass None. The idea behind it was to force the sample generator to generate something which is different than everything it previously made for the same desc.

However, I tried to follow how you reached your bug and from looking at the code it looks like all calls to gensamples are either when avoid is None or an array or list so perhaps you re-run the call to gensamples after avoid has somehow changed to an int64

thirumalaipm commented 8 years ago

Hi Udibr, Thanks for the prompt response. I could able to progress on that. But when I rerun the script, I could see the below error on In [60]. I see two rerun the script executed properly, but many time I see below error.

HEADS:

ValueError Traceback (most recent call last)

in () ----> 1 samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.) in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk) 33 fold_start = vocab_fold(start) 34 sample, score = beamsearch(predict=keras_rnn_predict, start=fold_start, avoid=avoid, avoid_score=avoid_score, ---> 35 k=k, temperature=temperature, use_unk=use_unk) 36 assert all(s[maxlend] == eos for s in sample) 37 samples += [(s,start,scr) for s,scr in zip(sample,score)] in beamsearch(predict, start, avoid, avoid_score, k, maxsample, use_unk, oov, empty, eos, temperature) 26 while live_samples: 27 # for every possible live sample calc prob for every possible label ---> 28 probs = predict(live_samples, empty=empty) 29 assert vocab_size == probs.shape[1] 30 in keras_rnn_predict(samples, empty, model, maxlen) 9 data = sequence.pad_sequences(samples, maxlen=maxlen, value=empty, padding='post', truncating='post') 10 probs = model.predict(data, verbose=0, batch_size=batch_size) ---> 11 return np.array([output2probs(prob[sample_length-maxlend-1]) for prob, sample_length in zip(probs, sample_lengths)]) in output2probs(output) 1 # out very own softmax 2 def output2probs(output): ----> 3 output = np.dot(output, weights[0]) + weights[1] 4 output -= output.max() 5 output = np.exp(output) ValueError: shapes (944,) and (40000,100) not aligned: 944 (dim 0) != 40000 (dim 0) Once this statement error, I see next statements also failing.with the same error. Please let me know where I am facing error.
udibr commented 8 years ago

If you look at cell [35] of https://github.com/udibr/headlines/blob/master/predict.ipynb you will see that the weights shape is (944, 40000) however your error message said (40000,100) which happens to be the shape of the embedding matrix loaded in step [10]

I am guessing that you re-run some of the cells in the notebook not in the exact order in which they appeared and somehow the embedding matrix was copied into weights (although I dont see how)

The safest way to run these notebooks is to "Kernel->Restart" and then execute the cells one after the other starting from the top...

thirumalaipm commented 8 years ago

I am getting the output for that cell[35] at predict is [(40000L, 100L)]. It is different from what is shown in your notebook.

The cell[34] output shows as follows- Loading data1/train.hdf5 to sequential_3 embedding_1 failed to find layer embedding_1 in model weights 40000x100 stopping to load all other layers

I also changed one parameter at Cell [9] nb_train_samples = 30000 nb_val_samples = 1000 I changed nb_val_samples value to 1000 because of an error message I got. I will reduce to the nb_train_sample to 10000 and try once. I am thinking could it be any issue..

When I am running the train notebook, I am getting below warning when In[60] running. C:\Anaconda2\lib\site-packages\keras\engine\training.py:1402: UserWarning: Epoch comprised more than samples_per_epoch samples, which might affect learning results. Set samples_per_epoch correctly to avoid this warning. warnings.warn('Epoch comprised more than '

udibr commented 8 years ago

Ok what happened is that calling load_weights failed at the first layer (it should fail at the last layer) and you got as a return value the weights of the first layer (which are (40000,100) and not that of the last layer (944,40000))

This could be because the train.ipynb notebook, which created the file loaded by load_weights, was run twice without restarting the kernel. Each time you re-create the network nodes (for example in cell 26 of train) Keras create new numbering (for example embedding_2 if embedding_1 is already in use.) You can easily fix the load_weights function to convert names from file to how they are named in the model and that will fix your problem

thirumalaipm commented 8 years ago

Hi, Thanks for the previous reply.

After some retries, I could able to progress further. But the training is taking much time. I stopped after 250 iteration on the train.ipynb notebook In[60].

I used that data file and executed prediction (predict.ipynb). I am getting an error at line In [60].

KeyError Traceback (most recent call last)

in () ----> 1 samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.) in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk) 13 x = X_test[i] 14 else: ---> 15 x = [word2idx[w.rstrip('^')] for w in X.split()] 16 17 if avoid: KeyError: 'Billy' Can you please let me how to solve this issue. Also please let me know the use of ^ on the text. Can I try using a single line of text with our ^ separation.
udibr commented 8 years ago

The '^' is just used for display to indicate that a word is not in the vocabulary used for modeling There is also an "external" vocabulary word2idx which includes all the words seen not just the smaller (internal) vocabulary used for training. However, it looks like Billy is not in it. All you need to do is add it to the external vocabulary. For example, the following code will add new words (and will keep existing words unmodified):

word2idx[new_word] = word2idx.get(new_word, len(word2idx))
thirumalaipm commented 8 years ago

Thanks udibr. This solution works.. :)