google / dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
https://github.com/google/dopamine
Apache License 2.0
10.42k stars 1.36k forks source link

distribution in softmax_cross_entropy_with_logits #120

Open xlnwel opened 4 years ago

xlnwel commented 4 years ago

The following code uses target distribution in the softmax_cross_entropy_with_logits. Is this correct?

https://github.com/google/dopamine/blob/6463bfa8660daf17823825ab884b118d3a57ea4e/dopamine/agents/rainbow/rainbow_agent.py#L259

BTW, I found the code for the projection in c51 is somewhat complicated. I use the following code to compute the projection in Eq.7 in the paper

y = tf.clip_by_value(supports, v_min, v_max)[:, None, :]                     # [B, 1, N]
target_support = target_support[None, :, None]                                # [1, N, 1]

y = tf.clip_by_value(1. - tf.abs(y - target_support) / delta_z, 0, 1)     # [B, N, N]
y = tf.reduce_sum(y * weights, axis=2)                                               # [B, N]
y = tf.stop_gradient(y)

All names except y stand for the same meaning as those defined in function project_distribution. Do you think they are doing the same thing?

psc-g commented 4 years ago

hi, regarding your first question, is there something that makes you believe it's not correct?

regarding your second question, one way you can verify this is by running the unit test.

xlnwel commented 4 years ago

Hi, Sorry for the first misunderstanding, my bad. I replace the code starts from here with the following code.

  with tf.control_dependencies(validate_deps):
    # Ex: `v_min, v_max = 4, 8`.
    v_min, v_max = target_support[0], target_support[-1]
    y = tf.clip_by_value(supports, v_min, v_max)[:, None, :]                     # [B, 1, N]
    target_support = target_support[None, :, None]                                # [1, N, 1]

    y = tf.clip_by_value(1. - tf.abs(y - target_support) / delta_z, 0, 1)     # [B, N, N]
    y = tf.reduce_sum(y * weights[:, None, :], axis=2)                                               # [B, N]
    projection = tf.stop_gradient(y)
    return projection

The test seems suggesting it's okay. Does it mean I'm correct in the second one?

image

Last, thanks for your help!

psc-g commented 4 years ago

yup, seems like the second one is a clean refactoring! :)

On Thu, Oct 17, 2019 at 8:45 PM The Raven Chaser notifications@github.com wrote:

Hi, Sorry for the first misunderstanding, my bad. I replace the code starts from here https://github.com/google/dopamine/blob/6463bfa8660daf17823825ab884b118d3a57ea4e/dopamine/agents/rainbow/rainbow_agent.py#L402 with the following code.

with tf.control_dependencies(validate_deps):

Ex: v_min, v_max = 4, 8.

v_min, v_max = target_support[0], target_support[-1]
y = tf.clip_by_value(supports, v_min, v_max)[:, None, :]                     # [B, 1, N]
target_support = target_support[None, :, None]                                # [1, N, 1]

y = tf.clip_by_value(1. - tf.abs(y - target_support) / delta_z, 0, 1)     # [B, N, N]
y = tf.reduce_sum(y * weights[:, None, :], axis=2)                                               # [B, N]
projection = tf.stop_gradient(y)

The test seems suggesting it's okay. Does it mean I'm correct in the second one? [image: image] https://user-images.githubusercontent.com/10184153/67057518-63ddc980-f183-11e9-95d5-d8613024ee43.png Last, thanks for your help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/120?email_source=notifications&email_token=AE3CCMM2NORWK2PR4RY5WDTQPEBKVA5CNFSM4JBWBTUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSBEXI#issuecomment-543429213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMNY5QB4FYRUHK7HUFDQPEBKVANCNFSM4JBWBTUA .

xlnwel commented 4 years ago

May I pull this solution to simplify the code?

psc-g commented 4 years ago

if you don't mind, let's hold off on this for a bit. we're still sorting out some issues with external pull requests. but thank you!

On Fri, Oct 18, 2019 at 7:07 AM The Raven Chaser notifications@github.com wrote:

May I pull this solution to simplify the code?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/120?email_source=notifications&email_token=AE3CCMOJWR6OZFEEAQK3MN3QPGKIXA5CNFSM4JBWBTUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBT5ZTI#issuecomment-543677645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMLA3TUQ6ZWBV4A7DYTQPGKIXANCNFSM4JBWBTUA .

xlnwel commented 4 years ago

Sure, no problem. But if you find this implementation is not right, please let me know.

By the way, thanks for your open-source code. That's awesome.