shenbiachao / DPLAN

6 stars 0 forks source link

Hello, I'd like to ask some questions about training #1

Open Ostrich5yw opened 2 years ago

Ostrich5yw commented 2 years ago

I found this part of the code: """
if tot_env_steps < self.start_timestep: continue """ Why do we need to skip loss updates in the first "start_timestep" rounds? Another question: In the 30 times of training, is the result of each time independent?and do you use the best one as the basis or the average as the basis? I would appreciate your reply :)

shenbiachao commented 2 years ago

Good afternoon!

For your first question, actually I think it's a common practice in RL to start updating parameters after some warmup timesteps. The replay buffer in DQN is empty at initialization, let's say the batch size is set to 32, as a result, it requires at least 32 steps for the buffer to store enough memory for learning, take a step further, a certain amount of warmup will ensure the agent has a relatively large buffer to learn from. The start_timestep is given in the original paper "Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data", on page 11, section A.3.2.

For your second question, the result is not independent through 30 training times, since the network of agent is continuously updated through the whole training stage. I use the result of the last training time as basis.

This code is just my simple reproduction of the paper "Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data", you can refer to the original paper for more detail.

Best regards!   ------------------ Original ------------------ From: @.>; Date:  Thu, Apr 14, 2022 11:30 AM To: @.>; Cc: @.***>; Subject:  [shenbiachao/DPLAN] Hello, I'd like to ask some questions about training (Issue #1)

 

I found this part of the code: """ if tot_env_steps < self.start_timestep: continue """ Why do we need to skip loss updates in the first "start_timestep" rounds? Another question: In the 30 times of training, is the result of each time independent?and do you use the best one as the basis or the average as the basis? I would appreciate your reply :)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Ostrich5yw commented 2 years ago

Your reply is very helpful to me, thank you!

Elii-hyy commented 2 years ago

Hello, after reading your code, I have doubts about several variables contained in the class TDReplayBuffer in the dqntrainer.py file, such as: n、n step obs buffer、discounted reward buffer、n step done buffer、n count_ buffer。 What are the functions of these variables? For the function add_tuple() that adds memory to the replay buffer, can I ask you about the logic of writing this function? The for loop、if and else parts of this function are not well understood. I would appreciate it if you could answer my questions。