Open wwb1987wave opened 1 week ago
Hi @wwb1987wave ,
For your first question, you can safely ignore this situation and press Ctrl + C
to terminate the process. Generally, I recommend setting a large number for num_episodes
and ending it manually once the reward curve stabilizes.
As for the second question, please refer to this issue for further guidance.
Thank you very much!
@venturi123 I'm sorry to bother you again. I have four files named 'total_reward.csv' in the env01, env02, env03 and env04. Which parameter is used to determine convergence? I find that the case in env05 is much slower than other cases. I am confused about how these envs work together. My understanding is that the envs work synchronously and recieve the actions of the agent simultaneously. Can you give me some hints?
@venturi123 Why is the calculation process different in different cases (env01, env02, ...) when the initial settings are the same? Is it because they receive different controls from the agent? What caused this difference? I have been trying to understand DRLinFluids by looking at the code these days, but I am still confused about the above issues. I think your professional advice can save me a lot of time. Thank you.
Hi @wwb1987wave,
You can average the total_reward of these envs to evaluate whether the agent has converged. Your understanding is correct—these environments run synchronously and receive actions from the agent simultaneously. Regarding the issue with env05, we also found it to be very slow during training (related to the setting of tensorforce), which is why we only analyzed the data from the first four environments. You can find the relevant settings in the cylinder case study section of our paper.
Even with identical initial conditions, different envs can produce different trajectories due to variations in random seeds. This randomness generally arises from the agent's exploration strategy. The training process of DRL often involves extensive exploration, where the agent continuously tries new actions. These actions depend on the current policy network output and the exploration strategy (e.g., ε-greedy or random noise).
We hope this response helps answer your questions :D
@1900360 Thank you for your prompt response. Your response is helpful to me. I find the 'env05' was deleted from environments in the program here
use_best_model = True if use_best_model: evaluation_environment = environments.pop() else: evaluation_environment = None
so only 'env01~env04' were used in the Runner. Namely, the 'env05' has only been initialized, but it was not used during the training process. I can't understand why?
Hi @wwb1987wave,
env05 is indeed only initialized and is not used during the training process. In the current cylinder training, we use four envs and still achieve control policy convergence. You can continue to try and see if there are any other issues.
I am newer at DRL. I change the num_episodes in DRLinFluids_cylinder/launch_multiprocessing_traning_cylinder.py into 30 and run "python DRLinFluids_cylinder/launch_multiprocessing_traning_cylinder.py". However, the program is stuck at the end, can you give me some suggestions? I have another question, how can I use the trained model?