jerabaul29 / Cylinder2DFlowControlDRLParallel

Parallelizing DRL for Active Flow control
62 stars 40 forks source link

Encountered an error when running your code #9

Closed weich97 closed 2 years ago

weich97 commented 2 years ago

Hi,

I am encountering an error when running the code. Have you ever encountered the following error before?

EOFError: Ran out of input msg = pickle.loads(msg)

Thanks,

Weicheng

jerabaul29 commented 2 years ago

No, I have not had this issue myself. How did you start the code? Are you using the container and the .sh parallel training scripts?

weich97 commented 2 years ago

Yes, I am using the container and .sh parallel training script. I am using the commands listed here: https://github.com/weich97/Cylinder2DFlowControlDRLParallel/blob/master/Docker/Code_Location_use_docker_Fenics_Tensorforce_parallel.md . I once or successfully ran the case, using 4 servers (bash script_launch_parallel.sh training 8000 4). I found that the speed was not satisfactory so I switched to 24. Before running 24, I deleted all the output files genered when running 4 servers, in order to avoid unexpected errors. I keep getting the pickle loading error mentioned earlier so I asked. Any insight? Thanks!

jerabaul29 commented 2 years ago

Mmmh, strange. My 2 cents, a bit difficult without seeing things myself: you run the code correctly through the docker container, in particular this is confirmed if you were able to run at least once :) . The error message looks like one of the simulation servers is not providing the data it should. One possibility is that when "cleaning up", you either forgot to clean up some folders or files, or that you "cleaned up" too much and removed some folders or files that are needed. Anther possibility is that you are trying to start too many simulations, and that this somehow is a problem (though I would not expect so; but, if you had a machine with only 8 CPUs for example, it may take quite some time to start 24 simulations, and these may not be ready before the learning script tries to start).

I have used the code myself with 20 CFD simulations running through the container, and this worked just fine; could you try with 20, and with 24, and see if things are different?

What I would recommend, is "start from the start", i.e., spin up a new container from the docker image, and start from scratch inside it :) . This way, you are guaranteed to start from a "good" state of the code :) . This is the whole point of docker images: spin up containers from these in a fully reproducible way :) .

Another thing you can start: spin up a simulation with 4 servers, as you did before. If it does not work any longer, something got broken during "cleanup".

One last point: you need all the ports from "port start" to "port start + nbr of servers" to be available; so if you start at port 8000, but port 8016 or any other between 8000 and 8023 was taken, you will get an error in this case.

weich97 commented 2 years ago

Thanks very much for your reply! I have been playing around with the code for a week. Yeah I carefully compared the code before and after running, and then remove the files generated due to running. Also, my machine has 48 cores (with hype-threading) so that I would guess 24 is fine.

Yeah I would try 20 and see if that works.

I also did what you recommended, i.e., "start from the start". The first time it worked but not for the following times (I've tried several times). I guess maybe there were some differences between the first time and the following times, so I would carefully test it further. I would let you know how it goes.

Yeah, I also tried to start using only 4 servers but now it failed. I will test it more.

I also tried changing the starting port number from 8000 to a different number.

Now I have a new machine and everything is new, including the git repo and the cocker container (newly installed). I am using 8 servers and still have this "ran out of input" issue. I will dig into this issue and see what causes this.

jerabaul29 commented 2 years ago

Strange. I have not experienced it myself.

weich97 commented 2 years ago

I forgot to tell. I changed line 81 in launch_parallel_training.py from num_episodes=400 to num_episodes to 10 or 2, as I just wanted to run this code using less epochs. I regarded num_episodes as the number of epochs (if I am wrong, please just let me know, thanks!). Using num_episodes = 400 cost too much time (I once use one core to run for 28 hrs and it only finished 32 epochs). Now I am setting this variaable to 20 and it seems that it can run for more steps (still running). I would guess setting it back to 400 should work (will test it).

weich97 commented 2 years ago

I later found out that num_episodes is not equal to the # of epochs. So, what is the variable to control the number of epochs in the code? Another question would be, is there a dependency between num_episodes and the number of servers used? As maybe if num_episodes is too small, and when the number of servers is too large, there is not enough data to feed some servers and cause this error? Or num_episodes should be divisable by the number of servers? I am guessing there may be a relationship somewhere.

jerabaul29 commented 2 years ago

Ahh, this is likely the problem I think. If you num_episodes is smaller than the number of servers, weird things may happen. Can you try increasing the num_episodes to 3 or 4 times the total number of servers, at least? :) .

This sounds very slow... Are your cores low frequency / low performance cores? I could run a couple of 100s of episodes at Re 100 within typically 24-48 hours using a modern stationary desktop computer with a core i7 a couple of years ago... Or did you change the mesh?

jerabaul29 commented 2 years ago

To make your mind clearer about this, I suggest that you read a bit more about the "foundations" of DRL. Can you try reading a few times, making sure that you understand all the details:

https://www.researchgate.net/profile/Jean-Rabault-2/publication/343934046_DEEP_REINFORCEMENT_LEARNING_APPLIED_TO_ACTIVE_FLOW_CONTROL/links/5fa4fad6a6fdcc062418972c/DEEP-REINFORCEMENT-LEARNING-APPLIED-TO-ACTIVE-FLOW-CONTROL.pdf .

https://github.com/jerabaul29/slides_DRL_FluidMechanics/blob/master/TemplateBeamer.pdf

but in 2 words:

If feels like your issue may be related to some misunderstanding about some of these concepts, and resulting "bad" use of the code. Unfortunately, I will not really have time to give a "private crash course" in DRL.

weich97 commented 2 years ago

Okay. I can increase num_episodes (Now it is 20 but will try 4 * the total number of servers). So there is a dependency between um_episodes and the number of servers. This also answers the second question that num_episodes may need to be divisible by the number of servers, otherwise some servers may idle for the last episode. So how to control the number of epochs?

I am using a Intel Xeon 2245 3.9GHz 8-core CPU. I have not yet changed the mesh as I want to see what output files I can get first. I would later change this problem to another problem.

weich97 commented 2 years ago

Okay. That would be very helpful. I will read the paper. Thanks!

jerabaul29 commented 2 years ago

Ok, weird, that should be a fast processor.

A small note: hyperthreading is not really "true multicore"; i.e., if you have 8 cores with multithreading, it may actually be faster to run using 8, not 16 simulations (and it may be actually faster to run using 7 simulations, to let 1 core for doing "extra stuff"). I was running typically the parallel version of the code on bigger CPUs than you do, server CPUs with 48 cores, and then usually putting 2 trainings at 20 cores each per actual "socket".

It does not need to be exactly divisible, but having some servers that are "never used" may lead to some issues.

weich97 commented 2 years ago

Okay, I see. Thanks so much!