udibr / headlines

Automatically generate headlines to short articles
MIT License
526 stars 150 forks source link

Keyerror when running gensamples in predict in Tensorflow #13

Closed xtr33me closed 7 years ago

xtr33me commented 7 years ago

I was wondering if you had any issues running predict in Tensorflow as I was having a few issues. Once I get these all resolved, I will go about requesting a pull request. One of the first errors I received was stating that 25 and 50 sizes were not compatible. I was able to resolve this by setting maxlend to 25. Then I ran and was getting some issues with the K.switch call for the assignment to activation_energies. I noticed that this was similar logic to what used to be in training's simple_context and since it was changed in training, I assumed that making it the same here would resolve the problem and it worked as well. The modification in predict's simple_context was to change from the previous entry to:

activation_energies = activation_energies + -1e20*K.expand_dims(1.-K.cast(mask[:, :maxlend],'float32'),1)

I then received a KeyError: '*' when running genSamples against the Billy Joel entry. After lookin through other issues on git I saw that this was due to the value not being in the dictionary. So I modified the else at the top of GenSamples to:

else:
        for w in X.split():
            w = w.rstrip('^')
            if not w in word2idx:
                word2idx[w] = word2idx.get(w, len(word2idx))

        x = [word2idx[w.rstrip('^')] for w in X.split()]

Now when I run genSamples in cell [43] I get the below error and I'm not quite understanding why the implementation would be providing me an index out of range error. I am able to get around this by modifying the for to also check for whether w is out of the range of idx2word, but this just seems so hacky and totally incorrect. For now this is what I have been trying to track the source of the problem of. Should you have any enlightenment for me, I'd love to hear it. Thanks and I will let you know if I get something worked out.

for w in sample:
            if w == eos or w >= len(idx2word):
                break

ERROR I AM GETTING:

HEADS: 17.3501874208 analysts kopparbergs 27.0937678814 cello better” owners 31.910461247 firefox fisher your adhesives


KeyError Traceback (most recent call last)

in () ----> 1 samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.) in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk) 52 if w == eos: 53 break ---> 54 words.append(idx2word[w]) 55 code += chr(w//(256*256)) + chr((w//256)%256) + chr(w%256) 56 if short: KeyError: 74477 --- This is the output of cell [12] for me: > dimension of embedding space for words 100 > vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words > total number of different words 74477 74477 > number of words outside vocabulary which we can substitue using glove similarity 12519 > number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 21958
xtr33me commented 7 years ago

This seems to be more of an issue with the amount of data I had trained against. By adjusting the sample_size and a few other items on my side, it seems that I am not seeing the error any longer as of now.