instadeepai / jumanji

🕹️ A diverse suite of scalable reinforcement learning environments in JAX
https://instadeepai.github.io/jumanji
Apache License 2.0
649 stars 80 forks source link

Implement Search & Rescue Multi-Agent Environment #259

Open zombie-einstein opened 3 weeks ago

zombie-einstein commented 3 weeks ago

Add a multi-agent search and rescue environment where a set of agents has to locate moving targets on a 2d space.

Changes

Todo

Questions

CLAassistant commented 3 weeks ago

CLA assistant check
All committers have signed the CLA.

zombie-einstein commented 3 weeks ago

Here you go @sash-a this is correct now. Will grab a look at the contributor license and Ci failure now.

zombie-einstein commented 3 weeks ago

I think CI issue is I've Esquilax set to Python >=3.10, seems you've a PR open to upgrade Python version, is it worth holding on for that?

sash-a commented 3 weeks ago

Python version PR is merged now so hopefully it will pass :smile:

Should have time during the week to review this, really appreciate the contribution!

sash-a commented 3 weeks ago

As for your questions in the description:

I only forwarded the Environment import to jumanji.environments do types also need forwarding somewhere?

Nope just the environment is fine

I didn't add an animate method to the environment, but saw that some other do? Easy enough to add.

Please do add animation it's a great help.

Do you want defaults for all the environment parameters? Not sure there are really "natural" choices, but could add sensible defaults to avoid some typing.

We do want defaults, I think we can discuss what makes sense.

Are the API docs auto-generated somehow, or do I need to add a link manually?

It's generated with mkdocs, we need an entry in docs/api/environments and mkdocs.yml, see this recently closed PR for an example of which files we change

One big thing I've realized that this is missing after my review is training code. We like to validate that the env works. I'm not 100% sure if this is possible because the env has two teams, so which reward do you optimize, maybe training with simple heuristic, eg you are the predator and the prey moves randomly? For examples see the training folder, you should only need to create a network. An example of this should also be in the above PR.

zombie-einstein commented 3 weeks ago

Hi @sash-a, just merged changes that I think address all the comments, and the animate method, and API docs link.

Not quite sure on the new swarms package, but also not sure where else we would put it. Not sure on it especially if we only have 1 env and no news ones planned.

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

One thing I don't quite understand is the benefit of amap over vmap specifically in the case of this env?

Yeah in a couple cases using it is overkill, hang-over from when I was writing this example with esquilax demo in mind! Makes sense to use vmap instead if the other arguments are not being used.

zombie-einstein commented 3 weeks ago

I'll look at adding something to training next. I think random prey with trained predators makes sense, will look to implement.

sash-a commented 3 weeks ago

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

If you can add more that would be great! Then I'm happy to keep the swarm package as is. What we'd be most interested in is some kind of env with only 1 team and strictly co-operative like predators vs heuristic prey or visa versa, not sure if you planned to make any envs like this?

But I had a quick look at the changes and it mostly looks great! Will leave an in depth review later today/tomorrow :smile:

Also I updated the CI yesterday, we're now using ruff, so you will need to update your pre-commit

sash-a commented 3 weeks ago

One other thing, the only reason I've been hesitant to add this to Jumanji is because it's not that related to industry problems which is a common focus between all the envs. I was thinking maybe we could re-frame the env from predator-prey to something else (without changing any code, just changing the idea). I was thinking maybe a continuous cleaner where your target position is changing or something to do with drones (maybe delivery), do you have any other ideas and would you be happy with this?

zombie-einstein commented 3 weeks ago

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

If you can add more that would be great! Then I'm happy to keep the swarm package as is. What we'd be most interested in is some kind of env with only 1 team and strictly co-operative like predators vs heuristic prey or visa versa, not sure if you planned to make any envs like this?

Yeah I was very interested in developing envs for co-operative multi-agent RL so was keen to design or implement more environments along theses lines. There's a simpler version of this environment which is just the flock, i.e. where the agents move in a co-ordinated way with out colliding. Also seen an environment where the agents have to effectively cover an an area that I was going to look at.

Also I updated the CI yesterday, we're now using ruff, so you will need to update your pre-commit

How do I do this? I did try reinstalling pre-commit, but it raised an error that the config was invalid?

zombie-einstein commented 3 weeks ago

One other thing, the only reason I've been hesitant to add this to Jumanji is because it's not that related to industry problems which is a common focus between all the envs. I was thinking maybe we could re-frame the env from predator-prey to something else (without changing any code, just changing the idea). I was thinking maybe a continuous cleaner where your target position is changing or something to do with drones (maybe delivery), do you have any other ideas and would you be happy with this?

Yeah definitely open to suggestions. I was thinking more in the abstract for this (will the agents develop some collective behaviour to avoid predators) but happy to modify towards something more concrete.

sash-a commented 2 weeks ago

Great to hear on the co-operative marl front those both sound like nice envs to have

How do I do this? I did try reinstalling pre-commit, but it raised an error that the config was invalid?

Couple things to try:

pip install -U pre-commit
pre-commit uninstall
pre-commit install

If this doesn't work check which pre-commit it should point to your virtual environment if it's pointing to your system python or some other system folder just uninstall that version and rerun the above.

Yeah definitely open to suggestions. I was thinking more in the abstract for this (will the agents develop some collective behaviour to avoid predators) but happy to modify towards something more concrete.

Agreed it would be nice to keep it abstract for the sake of research, but I think it's nice that this env suite is all industry focused. I quite like something to do with drones - seems quite industry focused although we must definitely avoid anything to do with war. I'll give it a think

zombie-einstein commented 2 weeks ago

Hi @sash-a fixed the formatting and consolidated the predator-prey type.

sash-a commented 2 weeks ago

Thanks I'll try have a look tomorrow, sorry previous 2 days were a bit more busy than expected.

For the theme I'm think maratime search and rescue works well. It's relatively real world and fits the current dynamics

zombie-einstein commented 2 weeks ago

Thanks I'll try have a look tomorrow, sorry previous 2 days were a bit more busy than expected.

For the theme I'm think maratime search and rescue works well. It's relatively real world and fits the current dynamic

Thanks, no worries. Actually yeah funnily enough a co-ordinated search was something I'd been looking into. Yeah could have one set of agent have some drift w random movements that need to be found inside the simulated region.

sash-a commented 2 weeks ago

Sorry still didn't have time to review today and Mondays are usually super busy for me, but I'll get to this next week!

As for the theme do you think we should then change the dynamics a bit to make prey heuristically controlled to move sort of randomly?

zombie-einstein commented 2 weeks ago

Sorry still didn't have time to review today and Mondays are usually super busy for me, but I'll get to this next week!

As for the theme do you think we should then change the dynamics a bit to make prey heuristically controlled to move sort of randomly?

No worries, sure I'll do a revision this weekend!

zombie-einstein commented 2 weeks ago

Hi @sash-a, this turned into a larger rewrite (sorry for the extra review work, let me know if you want me to close this PR and just start with a fresh one) but think it's a more realistic scenario

A couple choices we may want to consider:

sash-a commented 2 weeks ago

Thanks for this @zombie-einstein I'll start having a look now :smile:
I think leave the PR as is, no need to create a new one.

that has an interface to allow other behaviors

awesome!

Agents are only rewarded the first time a target is located

Agreed I think we should actually hide targets once they are located so as to not confuse other agents.

Agents are individually rewarded

I think individual is fine and externally users can sum it outside if they want. e.g we do this in mava for connector

At the moment agents only visualise other neighbours. A twist on this I considered was once targets are revealed they are then visualised (i.e. can be seen) by each agent as part of their local view.

Not quite following what you mean here. I would say an agent should observe all agents and targets (that have not yet been rescued) within their local view.

Do we want to scale rewards with how quickly targets are found, feels like it would make sense?

Maybe add this as an optional reward type, I think I prefer 1 if target is saved and 0 otherwise - makes the env quite hard, but we should test what works best.

I've assigned a fixed number of steps to locate the targets, but also seems it would makes sense to terminate the episode when all located?

Definitely!

As part of the observation I've included the remaining steps and targets as normalised floats, but not sure if you have some convention for values like this (i.e. just use integer values and let use rescale them)

We don't have a convention for this. I wouldn't add remaining steps to the obs directly I don't see why the algorithm would need that, although again needs to be tested. Agreed with remaining targets, makes sense to observe that. I think normalised floats makes sense.

zombie-einstein commented 2 weeks ago

Thanks @sash-a, just a couple follow ups to your questions:

At the moment agents only visualise other neighbours. A twist on this I considered was once targets are revealed they are then visualised (i.e. can be seen) by each agent as part of their local view.

Not quite following what you mean here. I would say an agent should observe all agents and targets (that have not yet been rescued) within their local view.

So I was picturing (and as currently implemented) a situation where the searchers have to come quite close the targets to "find" them (as if they are obscured/hard to find), but the agents have a larger vision range to visualise the location of other searchers agents (to allow them to improve search patterns for example).

My feeling was that this created more of a search task, where if the targets are part of their larger vision range it feels like it could be more of a routing type task.

I then thought it may be good to include found targets in the vision to allow agents to visualise density of located targets.

As part of the observation I've included the remaining steps and targets as normalised floats, but not sure if you have some convention for values like this (i.e. just use integer values and let use rescale them)

We don't have a convention for this. I wouldn't add remaining steps to the obs directly I don't see why the algorithm would need that, although again needs to be tested. Agreed with remaining targets, makes sense to observe that. I think normalised floats makes sense.

I thought if treating it as a time-sensitive task some indication of the remaining time to find targets could be a useful feature of the observation.

Please add a generator, dynamics and viewer test (see examples of the viewer test for other envs) Can you also add tests for the common/updates Can you start looking into the networks and testing for jumanji

Yup will do!

sash-a commented 1 week ago

Hi @zombie-einstein

So I was picturing (and as currently implemented) a situation where the searchers have to come quite close the targets to "find" them (as if they are obscured/hard to find), but the agents have a larger vision range to visualise the location of other searchers agents (to allow them to improve search patterns for example).

I think this is great!

I then thought it may be good to include found targets in the vision to allow agents to visualise density of located targets.

I see the thinking here, but I'm not even sure it's that beneficial because targets can move right?

One thing I'm a bit concerned about, as you increase the number of agents will the problem not get easier? As I don't see an option to increase the world size, so as the number of agents increases the density of searchers increases making it easier to find targets. Is there a way we could increase the world size or another way to avoid this issue?

Sorry been quite busy this last week, but I should have a lot more time next week to dedicate to this review :smile:

zombie-einstein commented 1 week ago

Hey @sash-a, no worries, still got stuff to get on with.

I then thought it may be good to include found targets in the vision to allow agents to visualise density of located targets.

I see the thinking here, but I'm not even sure it's that beneficial because targets can move right?

Yeah this is correct. I guess it depends on the target dynamics. For something simple like noisey movement with some drift it could help with identifying drift and areas of low density that have not been searched yet? Kind of like the agents are communicating what they've found/have some memory.

One thing I'm a bit concerned about, as you increase the number of agents will the problem not get easier? As I don't see an option to increase the world size, so as the number of agents increases the density of searchers increases making it easier to find targets. Is there a way we could increase the world size or another way to avoid this issue?

Yeah it would. The region is fixed in Esquilax to the unit square, mainly just reduce the number of parameters used in describing the interaction between agents (and make my life a bit easier 😂) but could be something to add into the library. The way to avoid this here would be scaling other parameters, e.g. scaling the vision range and speed range of agents, the only issue possibly being numerical accuracy.

For the network, is there a built-in way to do multi-agent training? If not I guess the most straightforward way to get it working would be to just have a single agent, and just wrap some means of flattening the rewards?

sash-a commented 1 week ago

Yeah this is correct. I guess it depends on the target dynamics.

True, maybe we can have it be an option? Although might get messy to define observation shapes

The region is fixed in Esquilax to the unit square

I see, so how many agents do you think it could scale to before it gets too crowded or too numerically unstable?

For the network, is there a built-in way to do multi-agent training? If not I guess the most straightforward way to get it working would be to just have a single agent, and just wrap some means of flattening the rewards?

That's exactly what we do. We wrap things in a the multi to single wrapper and then treat it as a single agent problem. See how we do the learning for LBF and Connector and this if statement in the trainer setup.

zombie-einstein commented 1 week ago

Yeah this is correct. I guess it depends on the target dynamics.

True, maybe we can have it be an option? Although might get messy to define observation shapes

Yeah could be an optional thing, or could just include by default and user can omit? Only issue might be impact of additional computation that may be unused?

The region is fixed in Esquilax to the unit square

I see, so how many agents do you think it could scale to before it gets too crowded or too numerically unstable?

I just added this functionality to Esquilax, and pulled changes into this so user can control the size of the space.

Do you mind if resolve some of these comments. Was not sure if you wanted to use them for tracking, but it'd be handy if could clear up outdated or implemented ones to see what remains outstanding.

sash-a commented 6 days ago

Yeah could be an optional thing, or could just include by default and user can omit? Only issue might be impact of additional computation that may be unused?

Ye not sure, let's leave it for now and we can come back to it later if we feel it makes the problem too easy or is unnecessary

I just added this functionality to Esquilax, and pulled changes into this so user can control the size of the space.

That's awesome! :fire:

Do you mind if resolve some of these comments. Was not sure if you wanted to use them for tracking, but it'd be handy if could clear up outdated or implemented ones to see what remains outstanding.

Ye go for it, if it's done then please resolve :smile: