sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow
MIT License
2.64k stars 960 forks source link

Sampling probablilities do not sum to 1 #1

Closed patroclos closed 7 years ago

patroclos commented 8 years ago

When I do "print repr(sum(p))" it always gives me numbers very close to 1.0 like 1.00000052 0.9999999248 and so on

Traceback (most recent call last): File "sample.py", line 38, in main() File "sample.py", line 21, in main sample(args) File "sample.py", line 35, in sample print model.sample(sess, chars, vocab, args.n, args.prime) File "/home/patro/Documents/Programming/NN/char-rnn-tensorflow/model.py", line 77, in sample sample = int(np.random.choice(len(p),p=p)) File "mtrand.pyx", line 1094, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:10565) ValueError: probabilities do not sum to 1

sherjilozair commented 8 years ago

Interesting. I haven't faced this myself.

How can I reproduce this?

Could you also mention your Numpy version?

patroclos commented 8 years ago

I am on 64bit Linux with numpy 1.9.2 and this happens when I run sample.py

patroclos commented 8 years ago

So, I worked around this issue by manually doing the random choice like this, since the sum does not greatly differ from actual 1.0

I replaced sample = int(np.random.choice(len(p), p=p)) with rn = random.random() cnt=0.0 for i in range(len(p)): cnt+=p[i] if rn <= cnt: sample=i break

sherjilozair commented 8 years ago

You could also try using Numpy 1.10; that's the version I am using. You could also explicitly normalize by p /= p.sum().

Thanks for posting your fix.

patroclos commented 8 years ago

I already tried to explicitly normalize p but that does nothing for me. I will try changing my numpy version later

nfmcclure commented 8 years ago

I too had this problem. Here is my ad-hoc fix in the model.py file:

    # EDIT BELOW:
    # Reason for edit:  the sample line below requires the sum(p) == 1.
    #  This is not the case always.
    #  Instead, let's implement a sampling algorithm that samples per
    #  weighted p.

def sample(self, sess, chars, vocab, num=200, prime='The '):
    state = self.cell.zero_state(1, tf.float32).eval()
    for char in prime[:-1]:
        x = np.zeros((1, 1))
        x[0, 0] = vocab[char]
        feed = {self.input_data: x, self.initial_state:state}
        [state] = sess.run([self.final_state], feed)

    def weighted_pick(weights):
        t = np.cumsum(weights)
        s = np.sum(weights)
        return(int(np.searchsorted(t, np.random.rand(1)*s)))

    ret = prime
    char = prime[-1]
    for n in xrange(num):
        x = np.zeros((1, 1))
        x[0, 0] = vocab[char]
        feed = {self.input_data: x, self.initial_state:state}
        [probs, state] = sess.run([self.probs, self.final_state], feed)
        p = probs[0]
        # sample = int(np.random.choice(len(p), p=p))
        sample = weighted_pick(p)
        pred = chars[sample]
        ret += pred
        char = pred
    return ret
sherjilozair commented 8 years ago

Thanks @nfmcclure. This seems like a bug in numpy. A similar issue was raised for Theano as well. I like your solution. Do you mind sending in a pull request for the same? Thanks.

nfmcclure commented 8 years ago

No problem. Thanks for doing this, by the way! It was on my list of projects to do.

ronxin commented 8 years ago

Thank you, @nfmcclure. I had the same problem and your fix works perfectly.

soobrosa commented 8 years ago

Hi, after training and hard patching with https://github.com/sherjilozair/char-rnn-tensorflow/pull/2 I get

$ python sample.py
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 4
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 4
None

What could I do wrong? (I'm OS X.)

sherjilozair commented 8 years ago

@soobrosa, The first 4 lines are expected. The last one which says "None" does not look good. Could you perhaps add print statements to see what is outputting that particular "None"?

Also, I've merged in @nfmcclure's patch (Thanks!). So, you could try again with the latest master. Maybe you introduced a bug when you patched it yourself.

soobrosa commented 8 years ago

@sherjilozair model.sample(sess, chars, vocab, args.n, args.prime) is None.

I trained, then tried to sample.

            print ckpt
            print ckpt.model_checkpoint_path

ends up in

model_checkpoint_path: "save/model.ckpt-22000"
all_model_checkpoint_paths: "save/model.ckpt-18000"
all_model_checkpoint_paths: "save/model.ckpt-19000"
all_model_checkpoint_paths: "save/model.ckpt-20000"
all_model_checkpoint_paths: "save/model.ckpt-21000"
all_model_checkpoint_paths: "save/model.ckpt-22000"

save/model.ckpt-22000
sherjilozair commented 8 years ago

Could you do a fresh checkout, and then run

  1. python train.py (let this run for 1000 iterations, and then kill it after the 1001th iteration)
  2. python sample.py Do not use any arguments to the scripts for sanity check.

If this works, then we can proceed to understanding why it's failing to work for your setup.

soobrosa commented 8 years ago

Thanks, after ~1000 iterations it spits the funny text.

ubergarm commented 7 years ago

Looks like y'all got it working, cleaning up some older issues. Thanks!