tkn-tub / veins-gym

Reinforcement Learning-based VANET simulations
https://www2.tkn.tu-berlin.de/software/veins-gym/
GNU General Public License v2.0
53 stars 8 forks source link

Env reset stops after some episodes #11

Closed Anas-1998 closed 2 years ago

Anas-1998 commented 2 years ago

Hello, I have a problem that when I do the training, that after like a thousand episode or a random number, it generates an error while doing env.reset(), even though it works for alot of episodes, how can I find the reason? The error is shown below. Please let me know how can I fix this. Thanks `RROR:root:Veins instance with PID 3784104 timed out after 15.00 seconds


TimeoutError Traceback (most recent call last)

in 69 70 while(i!=n_episodes): ---> 71 observation = env.reset() 72 done = False 73 score = 0 ~/.local/lib/python3.8/site-packages/gym/wrappers/order_enforcing.py in reset(self, **kwargs) 16 def reset(self, **kwargs): 17 self._has_reset = True ---> 18 return self.env.reset(**kwargs) ~/.local/lib/python3.8/site-packages/veins_gym/__init__.py in reset(self) 273 self._veins_shutdown_handler = veins_shutdown_handler 274 --> 275 initial_request = self._parse_request(self._recv_request())[0] 276 logging.info("Received first request from Veins, ready to run.") 277 return initial_request ~/.local/lib/python3.8/site-packages/veins_gym/__init__.py in _parse_request(self, data) 351 self.socket.send(init_msg.SerializeToString()) 352 # request next request (actual request with content) --> 353 real_data = self._recv_request() 354 real_request = veinsgym_pb2.Request() 355 real_request.ParseFromString(real_data) ~/.local/lib/python3.8/site-packages/veins_gym/__init__.py in _recv_request(self) 331 self._timeout, 332 ) --> 333 raise TimeoutError( 334 f"Veins instance did not send a request within {self._timeout}" 335 " seconds" TimeoutError: Veins instance did not send a request within 15.0 seconds `
lionyouko commented 2 years ago

Hi, Simplest way to try to solve it is to set the timeout longer, say 30s, using the timeout config when registering the env.

A proper solution would require you to analyze why sometimes when you try to initialize a new veins process when env.reset(), it is not connecting to the agent via zmq.

Try to print something every time your scenario starts, so you can se if it is starting correctly. Or also to print something for each component initializing that you have built for your scenario. So you can see if it is one in particular that is taking too long.

Anas-1998 commented 2 years ago

Hello, thanks for you reply. I put 30s, and got the same error. It tells me in the console of the note book this: Error: vector::_M_default_append Quitting (on error). My main problem is that it happens after a lot of episodes, and when I run a normal env.reset(), it works perfectly, so I can print as your solution state, because the printing does not show when I run it in loop from the gym environment, do you have any other suggestions?

lionyouko commented 2 years ago

Hello, thanks for you reply. I put 30s, and got the same error. It tells me in the console of the note book this: Error: vector::_M_default_append Quitting (on error). My main problem is that it happens after a lot of episodes, and when I run a normal env.reset(), it works perfectly, so I can print as your solution state, because the printing does not show when I run it in loop from the gym environment, do you have any other suggestions?

I am sorry, I never faced this error before. It may happen to happen in my current implementation for a work I am doing, but up to now I didn't have such error. I can't help.

You may want to put a print screen of the console to help (because in the error from the first message above there is no Error: vector::_M_default_append).

Anas-1998 commented 2 years ago

UBUNTU So as you can see in the photo, every time simulation starts, it shows the warning of tau being lower than time step, and then when it shutdown it shows quitting on recv shutting down. But at some point, it gives this vector error and quit, and I cant find out what does it mean this vector error.

serapergun commented 2 years ago

@Anas-1998 Did you make changes on serpentine-env to make a new project? If so, could you give me more detail, please?

I have an ended project, now I want to integrate on RL on it via veins-gym. But in these process I got so many many errors. One of them was look like that your error.

When I re-import the project to workspace and made a "snakemake -jall step" it was useful.

Regards,

dbuse commented 2 years ago

Hi @Anas-1998

It looks that you have some other errors occurring before the vector::_M_default_append error, and they are related to ZeroMQ (zmq::error_t). But as serapergun said: how did you modify your scenario and implementation?

Anas-1998 commented 2 years ago

I took the serpentine env, and removed the extra veins-vlc usage, and edited the application layer. It was working fine, but I dont know, out of no where, this problem popped up. Small edit: The zmq::error_t shows in the earlier episodes because I am not doing clean shutdowns, I an using env.reset() to shut it down. Basically I am starting a new episode before sending a shutdown command from the veins process.

dbuse commented 2 years ago

Basically I am starting a new episode before sending a shutdown command from the veins process.

This has a high potential to cause problems. If there are multiple veins processes and/or no clean shutdown takes place, the zmq sockets might get confused. Messages (or parts of them) may be received by different processes. So I would advise against this and rather suggest you ensure clean shutdown first.

Beyond this, I suggest the following to find out more about the vector-error: increase the timeout, switch to starting the veins simulation manually, and run it with a debugger attached. This should allow you to get a stack trace and find out what vector throws the error and in which context it happens.

Anas-1998 commented 2 years ago

Thanks for your reply. I understand what you are saying about starting it manually, however, the problem is, that in the first 1000 episodes, there is no errors, therefore, it does not show any errors when I attach a debugger, do you have any idea how I can resume the training and ignore the episode that bring this error?

dbuse commented 2 years ago

If it only ever happens so seldom, it might be a race condition. And yeah, those are hard to debug, which is why I stressed clean shutdown procedures so much.

You could probably extend veins-gym to retry failed episodes. Probably in the reset() method. Though if the error happens in the middle of the episode, partial data has already reached you agent, so you will have to accept or handle that somehow.

Anyway, as this seems to be an issue with your modifications and not the published code of Veins-Gym, I'll close the issue for now.