Closed masonlr closed 2 years ago
Testing this interactively using:
docker run -it --network evalai_rangl -e "RANGL_ENVIRONMENT_URL=http://nztc:5000" submission:v0.1.0 bash
root@93916536fecb:/service# python agent.py
/usr/local/lib/python3.8/site-packages/stable_baselines3/common/save_util.py:166: UserWarning: Could not deserialize object lr_schedule. Consider using `custom_objects` argument to replace this object.
warnings.warn(
Traceback (most recent call last):
File "agent.py", line 32, in <module>
action, _ = model.predict(obs, deterministic=True)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 552, in predict
return self.policy.predict(observation, state, mask, deterministic)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 333, in predict
observation, vectorized_env = self.obs_to_tensor(observation)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 250, in obs_to_tensor
vectorized_env = is_vectorized_observation(observation, self.observation_space)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 360, in is_vectorized_observation
return is_vec_obs_func(observation, observation_space)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 242, in is_vectorized_box_observation
raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.
Was the model trained using python 3.7, then loaded into the container in python 3.8?
The warning,
UserWarning: Could not deserialize object lr_schedule.
goes away with a python 3.7 container, i.e.
# Dockerfile
FROM python:3.7-slim-buster
This line
obs = client.env_reset(instance_id)
calls
def env_reset(self, instance_id):
route = "/v1/envs/{}/reset/".format(instance_id)
resp = self._post_request(route, None)
# NOTE: env.reset() currently has no return values
# therefore, bypass the response
# observation = resp["observation"]
return None
which returns None
.
The problem is that a None
action is being sent to the environment through
action, _ = model.predict(obs, deterministic=True)
@tomaszkosmala can you test whether this works now?
This is working now locally for me:
$ ./test_container.py
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
89960716 -19586.064453125
89960716 -6958.7119140625
89960716 -5751.5830078125
89960716 15062.984375
89960716 68996.4833984375
89960716 -86104.2353515625
89960716 15411.642578125
89960716 -35618.56298828125
89960716 -55815.54296875
89960716 270644.3193359375
89960716 329819.423828125
89960716 -145948.919921875
89960716 -1222577.4523925781
89960716 80895.70971679688
89960716 -1252747.8397827148
89960716 -165756.15325927734
89960716 223061.31018066406
89960716 164271.6160888672
89960716 160648.85369873047
89960716 78311.91040039062
89960716
done True
89960716
DEBUG:__main__:Instance id: 89960716
score: -1589740.8124389648
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
72e0804d -17663.49951171875
72e0804d -39395.52001953125
72e0804d -14425.7255859375
72e0804d 4999.158203125
72e0804d 40396.4423828125
72e0804d 47344.49609375
72e0804d -23383.4521484375
72e0804d -69302.5126953125
72e0804d 44115.716796875
72e0804d 48256.90576171875
72e0804d -65672.86181640625
72e0804d -67500.90185546875
72e0804d -382326.2316894531
72e0804d 21526.798583984375
72e0804d -450947.19677734375
72e0804d -12376.946411132812
72e0804d 140515.20739746094
72e0804d 95121.47338867188
72e0804d 50256.24041748047
72e0804d 75582.12493896484
72e0804d
done True
72e0804d
DEBUG:__main__:Instance id: 72e0804d
score: -574880.2845458984
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
79653f5c -24304.501953125
79653f5c -7722.6708984375
79653f5c -40160.8115234375
79653f5c -62750.9736328125
79653f5c 15224.8818359375
79653f5c -20076.974609375
79653f5c 28812.7255859375
79653f5c 74819.01953125
79653f5c -28468.47802734375
79653f5c 84301.7275390625
79653f5c -163977.232421875
79653f5c 20448.7373046875
79653f5c -659855.4155273438
79653f5c 64545.4345703125
79653f5c -845908.8607788086
79653f5c -106891.1127319336
79653f5c 26186.35723876953
79653f5c -47604.048400878906
79653f5c 45113.413146972656
79653f5c 93230.39147949219
79653f5c
done True
79653f5c
DEBUG:__main__:Instance id: 79653f5c
score: -1555038.3922729492
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
8b894cb2 -20275.9384765625
8b894cb2 -11439.5419921875
8b894cb2 -8719.0009765625
8b894cb2 30840.8740234375
8b894cb2 -98824.7734375
8b894cb2 46068.30078125
8b894cb2 -211260.7265625
8b894cb2 -3565.8408203125
8b894cb2 60539.748046875
8b894cb2 -27943.009765625
8b894cb2 11512.95947265625
8b894cb2 79028.2294921875
8b894cb2 -461691.20068359375
8b894cb2 -40380.07861328125
8b894cb2 -668979.7713012695
8b894cb2 55951.021484375
8b894cb2 224907.7345275879
8b894cb2 102591.51126098633
8b894cb2 75655.54992675781
8b894cb2 204345.17309570312
8b894cb2
done True
8b894cb2
DEBUG:__main__:Instance id: 8b894cb2
score: -661638.7805175781
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
da8bddcf -21379.6787109375
da8bddcf -7959.3447265625
da8bddcf 17643.02734375
da8bddcf -27628.478515625
da8bddcf -108573.2392578125
da8bddcf -157613.6435546875
da8bddcf -76909.03857421875
da8bddcf -58217.59423828125
da8bddcf 112664.845703125
da8bddcf 149576.12353515625
da8bddcf -81137.97265625
da8bddcf 6548.05517578125
da8bddcf -536688.3666992188
da8bddcf 152630.26000976562
da8bddcf -739540.5778808594
da8bddcf -118758.14379882812
da8bddcf 97778.27947998047
da8bddcf 174423.04473876953
da8bddcf 153402.8550415039
da8bddcf 154750.94342041016
da8bddcf
done True
da8bddcf
DEBUG:__main__:Instance id: da8bddcf
score: -914988.6441650391
Evaluation completed using 5 seeds.
Final average score: -1059257.3827880858
There is no error when executing test_container.py
, which is great. There are two issues remaining:
[] the result differ slightly between from the local evaluation e.g. locally -1,044,402.4792280579 vs in the container -1,001,111.8701486206,
[] the submission of the meaningful agent to evalai fails.
(the seeds in the folders local_agent_training_and_evaluation
and meaningful_agent_submission
are different, but I evaluated against the same set of seeds)
I also have test_container.py
executing successfully (with the caveat below). Now trying submission.
In case this is relevant: test_container.py
works with 5 seeds, but when using the larger set https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/meaningful_agent_submission/seeds.csv it runs for a good while then exits with this error:
...
done True
ec57a1c5
DEBUG:__main__:Instance id: ec57a1c5
score: -1356883.9453735352
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
3b17c193 -17767.06640625
3b17c193 -30234.115234375
3b17c193 -9174.501953125
3b17c193 36726.33251953125
3b17c193 -7420.02734375
3b17c193 -94552.37939453125
3b17c193 -78216.27978515625
3b17c193 13097.20703125
3b17c193 1307.369140625
3b17c193 158882.56762695312
3b17c193 126855.59545898438
3b17c193 -11982.5048828125
3b17c193 -300943.041015625
3b17c193 99772.11499023438
3b17c193 -674147.873840332
3b17c193 -184090.33795166016
3b17c193 53696.27404785156
3b17c193 48386.46057128906
3b17c193 101092.71270751953
3b17c193 20183.45538330078
3b17c193
done True
3b17c193
DEBUG:__main__:Instance id: 3b17c193
score: -748528.0383300781
Traceback (most recent call last):
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.41/containers/create?name=agent
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/johnmoriarty/rangl/netzerotc/meaningful_agent_submission/./test_container.py", line 23, in <module>
submission = client.containers.run(
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/models/containers.py", line 819, in run
container = self.create(image=image, command=command,
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/models/containers.py", line 878, in create
resp = self.client.api.create_container(**create_kwargs)
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/container.py", line 428, in create_container
return self.create_container_from_config(config, name)
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/container.py", line 439, in create_container_from_config
return self._result(res, True)
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 274, in _result
self._raise_for_status(response)
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 409 Client Error for http+docker://localhost/v1.41/containers/create?name=agent: Conflict ("Conflict. The container name "/agent" is already in use by container "e38e366af014ba35e2b83cf3b8ee116059f6fdd887770a39de8feb4ec3850005". You have to remove (or rename) that container to be able to reuse that name.")
@masonlr sorry for slight newbie question but in https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/random_agent_submission/README.md and https://github.com/rangl-labs/netzerotc/blob/meaningful_agent/meaningful_agent_submission/README.md , should we maybe write docker-compose up -d --build
instead of docker-compose up --build
, to avoid need for a new terminal?
The meaningful agent pushed successfully but remote evaluation failed. The stderr file was:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/containers/create
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 812, in run
detach=detach, **kwargs)
File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 870, in create
resp = self.client.api.create_container(**create_kwargs)
File "/usr/local/lib/python3.7/site-packages/docker/api/container.py", line 430, in create_container
return self.create_container_from_config(config, name)
File "/usr/local/lib/python3.7/site-packages/docker/api/container.py", line 441, in create_container_from_config
return self._result(res, True)
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
self._raise_for_status(response)
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/containers/create: Not Found ("No such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0")
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/images/614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0/json
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/code/scripts/workers/submission_worker.py", line 491, in run_submission
submission_metadata=submission_serializer.data,
File "/tmp/tmp0yspabgj/compute/challenge_data/challenge_1/main.py", line 84, in evaluate
f"RANGL_SEED={seed}",
File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 814, in run
self.client.images.pull(image, platform=platform)
File "/usr/local/lib/python3.7/site-packages/docker/models/images.py", line 456, in pull
repository, tag, '@' if tag.startswith('sha256:') else ':'
File "/usr/local/lib/python3.7/site-packages/docker/models/images.py", line 316, in get
return self.prepare_model(self.client.api.inspect_image(name))
File "/usr/local/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/docker/api/image.py", line 254, in inspect_image
self._get(self._url("/images/{0}/json", image)), True
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
self._raise_for_status(response)
File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/images/614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0/json: Not Found ("no such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0: No such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0")
@masonlr sorry for slight newbie question but in https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/random_agent_submission/README.md and https://github.com/rangl-labs/netzerotc/blob/meaningful_agent/meaningful_agent_submission/README.md , should we maybe write
docker-compose up -d --build
instead ofdocker-compose up --build
, to avoid need for a new terminal?
Yes this is fine. The main reason for the second terminal is to watch the logs and be able to quickly kill with ctrl+c. If you don't need that, then -d
is fine.
The main problem was that the VM ran out of disk space. Every time there is a submission it has to clone the image and run – it only has a 30GB drive at the moment.
I've added a 200GB drive and am in the process of moving the docker storage over to that.
I've submitted the meaningful agent image and the round trip is working.
I'm creating a 512GB disk and mounting it to /mnt/challenge
. I will then modify /etc/docker/demon.json
with the following:
{
"data-root": "/mnt/challenge",
"storage-driver": "overlay2"
}
Then will restart docker:
sudo systemctl restart docker
Round trip should be working now through evalai.
I also have test_container.py executing successfully (with the caveat below). Now trying submission.
In case this is relevant: test_container.py works with 5 seeds, but when using the larger set https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/meaningful_agent_submission/seeds.csv it runs for a good while then exits with this error:
For this one, I've just added in a debug line:
for i, seed in enumerate(seeds):
print(f"INFO: evaluating seed {i} of {len(seeds)}")
...
It's working okay on my laptop:
...
done True
9a2d8468
DEBUG:__main__:Instance id: 9a2d8468
score: -894961.9956665039
INFO: evaluating seed 73 of 100
...
done True
412b11d9
DEBUG:__main__:Instance id: 412b11d9
score: -367544.2293395996
Evaluation completed using 100 seeds.
Final average score: -1001111.8701486206
The roundtrip works for me as well. Currently I'm getting different results from local evaluation, local evaluation using test_container.py and remote on evalai (using 5 seeds).
The roundtrip works for me as well. Currently I'm getting different results from local evaluation, local evaluation using test_container.py and remote on evalai (using 5 seeds).
The evaluation script in https://github.com/rangl-labs/netzerotc/tree/main/evaluation/evaluation_script generates the list of random seeds from a single random seed (namely, 3423232). Perhaps different Python environments handle this differently, potentially generating different lists?
random.seed(3423232) # set a seed so that we generate the same seed_list each time
N_seeds = 5
seed_list = [random.randint(0, 1e7) for _ in range(N_seeds)]
@tomaszkosmala reports that
env.py
, it can be successfully submitted to both the open-loop and the closed-loop phases on evalai, and evaluates successfully in both phases with exactly the same score env.py
, local evaluation with the closed-loop env.py
returns an errorenv.py
, evaluation in the closed-loop phase of evalai returns an error.This suggests that the evalai closed-loop and open-loop phases might be identical. @masonlr can we check that the evalai closed-loop phase is using the closed-loop env.py
please?
Although I got the make build
(as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error with docker-compose up -d --build
on the VM:
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Traceback (most recent call last):
File "docker/credentials/store.py", line 80, in _execute
File "subprocess.py", line 411, in check_output
File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/docker-compose", line 3, in <module>
File "compose/cli/main.py", line 67, in main
File "compose/cli/main.py", line 126, in perform_command
File "compose/cli/main.py", line 1070, in up
File "compose/cli/main.py", line 1066, in up
File "compose/project.py", line 615, in up
File "compose/service.py", line 346, in ensure_image_exists
File "compose/service.py", line 1125, in build
File "docker/api/build.py", line 261, in build
File "docker/api/build.py", line 308, in _set_auth_headers
File "docker/auth.py", line 302, in get_all_credentials
File "docker/credentials/store.py", line 71, in list
File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration".
[358537] Failed to execute script docker-compose
@tomaszkosmala reports that
- However if an agent is trained in the open-loop
env.py
, local evaluation with the closed-loopenv.py
returns an error
This is perhaps because https://github.com/rangl-labs/netzerotc/blob/27e43fb1f8e61b3b053acd2773080a72f875b72d/local_agent_training_and_evaluation/evaluate.py#L10
is creating an open-loop env only. We might either change the above line to env = gym.make("rangl:nztc-closed-loop-v0")
, or rename it as 'evaluate_open_loop.py' and then duplicate it and make the change and then save it as 'evaluate_closed_loop.py'.
Although I got the
make build
(as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error withdocker-compose up -d --build
on the VM:... [358537] Failed to execute script docker-compose
Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/
Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/
I tried following the instructions at https://docs.docker.com/engine/install/ubuntu/, but got stuck at the following:
(base) jia-chen@trainVM:~$ sudo apt-get remove docker docker-engine docker.io containerd runc
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg !=
E: The list of sources could not be read.
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg !=
E: The list of sources could not be read.
(base) jia-chen@trainVM:~$ sudo apt-get update
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg !=
E: The list of sources could not be read.
(base) jia-chen@trainVM:~$
Although I got the
make build
(as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error withdocker-compose up -d --build
on the VM:(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build Building nztc Traceback (most recent call last): File "docker/credentials/store.py", line 80, in _execute File "subprocess.py", line 411, in check_output File "subprocess.py", line 512, in run subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "bin/docker-compose", line 3, in <module> File "compose/cli/main.py", line 67, in main File "compose/cli/main.py", line 126, in perform_command File "compose/cli/main.py", line 1070, in up File "compose/cli/main.py", line 1066, in up File "compose/project.py", line 615, in up File "compose/service.py", line 346, in ensure_image_exists File "compose/service.py", line 1125, in build File "docker/api/build.py", line 261, in build File "docker/api/build.py", line 308, in _set_auth_headers File "docker/auth.py", line 302, in get_all_credentials File "docker/credentials/store.py", line 71, in list File "docker/credentials/store.py", line 93, in _execute docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration". [358537] Failed to execute script docker-compose
This is due to how AWS credentials are set up on the VM: docker should be working correctly on that VM.
@jia-chenhua can you please run the following:
$ env | grep AWS_DEFAULT_REGION
AWS_DEFAULT_REGION=us-east-1
If AWS_DEFAULT_REGION
is not set, can you please set it using:
export AWS_DEFAULT_REGION=us-east-1
and ideally add this same line to the end of ~/.bashrc
.
@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.
dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
"credsStore": "ecr-login"
}
This should fix the docker-compose issue above.
Although I got the
make build
(as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error withdocker-compose up -d --build
on the VM:... [358537] Failed to execute script docker-compose
Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/
@moriartyjm which VM is this on? (trainVM
?) Also, which user are you logged in as? (dev
?) I've just tested docker-compose up
from trainVM
and it is working if I'm logged in as dev
.
@tomaszkosmala reports that
* If an agent is trained in the open-loop `env.py`, it can be successfully submitted to both the open-loop and the closed-loop phases on evalai, and evaluates successfully in both phases with exactly the same score * However if an agent is trained in the open-loop `env.py`, local evaluation with the closed-loop `env.py` returns an error * If an agent is trained in the closed-loop `env.py`, evaluation in the closed-loop phase of evalai returns an error.
This suggests that the evalai closed-loop and open-loop phases might be identical. @masonlr can we check that the evalai closed-loop phase is using the closed-loop
env.py
please?
There is now a script that trains two models:
meaningful_agent_training/test_create_models.py
This will create saved_models/MODEL_closed_loop_0.zip
and saved_models/MODEL_open_loop_0.zip
There are also now dedicated folders for closed-loop and open-loop, i.e.
meaningful_agent_submission/closed_loop
meaningful_agent_submission/open_loop
Results from testing this locally currently make sense: the closed-loop agent gives reproducible results with the closed-loop environment. (Similarly, the open-loop agent gives reproducible results with the open-loop environment).
If you submit the closed-loop agent to the open-loop environment there is an error. (Similarly, if you submit the open-loop agent to the closed-loop environment there is an error.)
Pre-trained agent: Closed loop Environment: Closed loop
Test 1:
Evaluation completed using 8 seeds.
Final average score: 83324.07849884033
Test 2:
Evaluation completed using 8 seeds.
Final average score: 83324.07849884033
Pre-trained agent: Open loop Environment: Closed loop
INFO: evaluating seed 0 of 8
Traceback (most recent call last):
File "/Users/lrmason/github.com/rangl-labs/netzerotc/meaningful_agent_submission/closed_loop/./test_container.py", line 27, in <module>
submission = client.containers.run(
File "/Users/lrmason/miniconda3/envs/dev/lib/python3.10/site-packages/docker/models/containers.py", line 848, in run
raise ContainerError(
docker.errors.ContainerError: Command 'None' in image 'submission-closed-loop:v0.1.0' returned non-zero exit status 1
Interactive debugging gives:
root@87fcc9c273a2:/service# python agent.py
Traceback (most recent call last):
File "agent.py", line 35, in <module>
action, _ = model.predict(obs, deterministic=True)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 473, in predict
return self.policy.predict(observation, state, mask, deterministic)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 281, in predict
vectorized_env = is_vectorized_observation(observation, self.observation_space)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 226, in is_vectorized_observation
raise ValueError(
ValueError: Error: Unexpected observation shape (16,) for Box environment, please use (1,) or (n_env, 1) for the observation shape.
Pre-trained agent: Open loop Environment: Open loop
Test 1:
Evaluation completed using 8 seeds.
Final average score: -57440.163009643555
Test 2:
Evaluation completed using 8 seeds.
Final average score: -57440.163009643555
Pre-trained agent: Closed loop Environment: Open loop
INFO: evaluating seed 0 of 8
Traceback (most recent call last):
File "/Users/lrmason/github.com/rangl-labs/netzerotc/meaningful_agent_submission/open_loop/./test_container.py", line 27, in <module>
submission = client.containers.run(
File "/Users/lrmason/miniconda3/envs/dev/lib/python3.10/site-packages/docker/models/containers.py", line 848, in run
raise ContainerError(
docker.errors.ContainerError: Command 'None' in image 'submission-open-loop:v0.1.0' returned non-zero exit status 1
Interactive debugging gives:
root@de9557b4c7a4:/service# python agent.py
Traceback (most recent call last):
File "agent.py", line 35, in <module>
action, _ = model.predict(obs, deterministic=True)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 473, in predict
return self.policy.predict(observation, state, mask, deterministic)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 281, in predict
vectorized_env = is_vectorized_observation(observation, self.observation_space)
File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 226, in is_vectorized_observation
raise ValueError(
ValueError: Error: Unexpected observation shape (1,) for Box environment, please use (16,) or (n_env, 16) for the observation shape.
To test the above with EvalAi, there is a temporary challenge that corresponds to the meaningful_agent
branch with 8 seeds for the evaluation: see https://github.com/rangl-labs/netzerotc/blob/a4082c2467fde333e3298712a7a571ff9175f6f0/evaluation/evaluation_script/main.py#L59-L66
Running the combinations gives:
Pre-trained agent: Closed loop Environment: Closed loop
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4
The results file in the evalai frontend is:
{"Average Cost": -83324.07849884033}
Pre-trained agent: Closed loop Environment: Closed loop
evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4
The results file in the evalai frontend is:
{"Average Cost": 57440.163009643555}
Pre-trained agent: Open loop Environment: Closed loop
# after modifying the agent.py file to load the open loop model
# MODEL_PATH = "saved_models/MODEL_open_loop_0"
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4
Fails
test Submission ID: 22
Submission Status : failed
Execution Time (sec) : 6.2344
Submitted At : 01/11/22 03:39:11 AM
Pre-trained agent: Closed loop Environment: Open loop
# after modifying the agent.py file to load the closed loop model
# MODEL_PATH = "saved_models/MODEL_closed_loop_0"
evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4
Fails
test Submission ID: 24
Submission Status : failed
Execution Time (sec) : 2.129368
Submitted At : 01/11/22 03:43:11 AM
To test the above with EvalAi, there is a temporary challenge that corresponds to the
meaningful_agent
branch with 8 seeds for the evaluation: see
Many thanks for testing this. I have now hidden the old challenge on EvalAI and left visible the 'temporary' challenge, which we can now use as the main challenge (as it has been fully tested).
Merging because fully tested. I will formally introduce the participants to EvalAI and the repo now and we can take care of enlarging the set of seeds later this week.
@moriartyjm which VM is this on? (
trainVM
?) Also, which user are you logged in as? (dev
?) I've just testeddocker-compose up
fromtrainVM
and it is working if I'm logged in asdev
.
Hi @masonlr , I'm now testing it on trainVM
(51.140.94.162) with username jia-chen
, and here are the outputs and error message:
(base) jia-chen@trainVM:~$ env | grep AWS_DEFAULT_REGION
(base) jia-chen@trainVM:~$ export AWS_DEFAULT_REGION=us-east-1
(base) jia-chen@trainVM:~$ env | grep AWS_DEFAULT_REGION
AWS_DEFAULT_REGION=us-east-1
(base) jia-chen@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
"credsStore": "ecr-login"
}
(base) jia-chen@trainVM:~$ cd github-repo/netzerotc/
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Traceback (most recent call last):
File "docker/credentials/store.py", line 80, in _execute
File "subprocess.py", line 411, in check_output
File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bin/docker-compose", line 3, in <module>
File "compose/cli/main.py", line 67, in main
File "compose/cli/main.py", line 126, in perform_command
File "compose/cli/main.py", line 1070, in up
File "compose/cli/main.py", line 1066, in up
File "compose/project.py", line 615, in up
File "compose/service.py", line 346, in ensure_image_exists
File "compose/service.py", line 1125, in build
File "docker/api/build.py", line 261, in build
File "docker/api/build.py", line 308, in _set_auth_headers
File "docker/auth.py", line 302, in get_all_credentials
File "docker/credentials/store.py", line 71, in list
File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration".
[370477] Failed to execute script docker-compose
(base) jia-chen@trainVM:~/github-repo/netzerotc$
with the following screenshot:
@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.
@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?
@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.
@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?
I'm on Vodafone's broadband at home, and the upload speed is below 10Mbps or 1MBytes/s. Not sure if this is faster or slower than copper. But anyway, since I don't have Docker working on Windows on my laptop (yet), submitting via VM is the only possible way for me.
@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.
@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?
@moriartyjm now has access to trainVM.
@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.
dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json { "credsStore": "ecr-login" }
This should fix the docker-compose issue above.
@jia-chenhua, you will need to remove the file at ~/.docker/config.json
.
I can change user to your account and test this, i.e.
$ sudo su -- jia-chen
(base) jia-chen@trainVM:/home/dev/netzerotc$ cat /home/jia-chen/.docker/config.json
{
"credsStore": "ecr-login"
}
(base) jia-chen@trainVM:/home/dev/netzerotc$ sudo rm -f /home/jia-chen/.docker/config.json
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Step 1/8 : FROM python:3.9-slim-buster
---> 2f2210ecbb1c
Step 2/8 : WORKDIR /service
---> Using cache
---> bf0b847c34f0
Step 3/8 : COPY rangl/requirements.txt .
...
@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.
dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json { "credsStore": "ecr-login" }
This should fix the docker-compose issue above.
@jia-chenhua, you will need to remove the file at
~/.docker/config.json
.I can change user to your account and test this, i.e.
$ sudo su -- jia-chen (base) jia-chen@trainVM:/home/dev/netzerotc$ cat /home/jia-chen/.docker/config.json { "credsStore": "ecr-login" } (base) jia-chen@trainVM:/home/dev/netzerotc$ sudo rm -f /home/jia-chen/.docker/config.json (base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build Building nztc Step 1/8 : FROM python:3.9-slim-buster ---> 2f2210ecbb1c Step 2/8 : WORKDIR /service ---> Using cache ---> bf0b847c34f0 Step 3/8 : COPY rangl/requirements.txt . ...
Thanks a lot! I can confirm it's working now, and I'll continue to test the submission.
@moriartyjm Now I'm pleased to confirm that the submission to http://submissions.rangl.org via trainVM
is working properly:
I copied the brute-force trained model 717 for open-loop to meaningful_agent_submission/open_loop/saved_models
, and then modified the meaningful_agent_submission/open_loop/agent.py
to load it, and then run python test_container.py
(after commenting out https://github.com/rangl-labs/netzerotc/blob/245a6072d748763d7013a7e680696429ea061b71/meaningful_agent_submission/open_loop/test_container.py#L37 following @masonlr 's help; maybe we need to fix it and push it), then I got:
done True
c8be5137
DEBUG:__main__:Instance id: c8be5137
score: 720166.0662841797
Evaluation completed using 8 seeds.
Final average score: 649451.1227722168
on the trainVM
. Then I run evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4
after login to http://submissions.rangl.org, where I can see my submission and the "Stdout file" shows:
done True
6987374b
instance_id: 6987374b
{'score': {'value1': 720166.0662841797}}
mean_score 649451.1227722168
phase_codename open-loop
output {'result': [{'test_split': {'Average Cost': -649451.1227722168}}], 'submission_result': {'Average Cost': -649451.1227722168}}
Completed evaluation for open-loop phase
which matches the local results on the trainVM
And for the closed loop, I copied the brute-force trained model 694 for closed loop to meaningful_agent_submission/closed_loop/saved_models
, and then modified the meaningful_agent_submission/closed_loop/agent.py
to load it, and then run python test_container.py
(the command="sleep infinity",
doesn't exist in https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_submission/closed_loop/test_container.py, so there is no need to comment it out), then I got:
099db9c1
done True
099db9c1
DEBUG:__main__:Instance id: 099db9c1
score: 704282.6439208984
Evaluation completed using 8 seeds.
Final average score: 501442.3190917969
on the trainVM
. Then I run evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4
after login to http://submissions.rangl.org, where I can see my submission and the "Stdout file" shows:
mean_score 501442.3190917969
phase_codename closed-loop
output {'result': [{'test_split': {'Average Cost': -501442.3190917969}}], 'submission_result': {'Average Cost': -501442.3190917969}}
which again, matches the local results on the trainVM
So now it seems working properly, although it's weird that the size of open-loop submission image is > 3GB while the size of closed-loop submission image is only ~8.2MB (Update: according to @masonlr, the smaller sized file contain incremental diffs, since there's a lot of overlap between the two images and docker caches the common layers). My two submissions above are currently shown in the leaderboard. @moriartyjm , please let me know if I should hide them from the leaderboard (Update: @masonlr has now reset the EvalAI system so all data I submitted has been cleared, so it's fine now, and since the submission is working via trainVM
, I can re-do it whenever needed).
Additional thoughts:
evalai push submission:v0.1.0 --phase <phase_name>
to evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4
in the open-loop folder and evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4
in the closed loop folder. This would be helpful for people without experience in Docker/EvalAI.The sleep infinity
problem is fixed as part of #46.
The README files will also be updated as part of #46.
@jia-chenhua As a final check, could you please compare the following evaluation results:
evaluate.py
script)test_container.py
Many thanks!@jia-chenhua As a final check, could you please compare the following evaluation results:
- Evaluation locally without a container (eg simply using the
evaluate.py
script)- Evaluation locally in a container with the
test_container.py
Many thanks!
I didn't get the same result on trainVM
when running the https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_training/evaluate.py (after deleting all seeds in https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_training/seeds.csv except the first 8) compared to running the test_container.py
:
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_training$ python evaluate.py
Mean reward of model closed_loop_694: 605108.5162811279
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_training$ python evaluate.py
Mean reward of model open_loop_717: 2022223.5828170776
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_submission/closed_loop$ python test_container.py
Final average score: 501442.3190917969
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_submission/open_loop$ python test_container.py
Final average score: 649451.1227722168
Also, changing https://github.com/rangl-labs/netzerotc/blob/7b392a24bc0056bcdede35367ce5b8c07634ccd0/meaningful_agent_submission/closed_loop/agent.py#L9 and https://github.com/rangl-labs/netzerotc/blob/7b392a24bc0056bcdede35367ce5b8c07634ccd0/meaningful_agent_submission/open_loop/agent.py#L9 to other model numbers didn't change the result of python test_container.py
: I tried MODEL_PATH = "saved_models/MODEL_open_loop_717"
and MODEL_PATH = "saved_models/MODEL_closed_loop_694"
, but the results are exactly the same with MODEL_PATH = "saved_models/MODEL_open_loop_0"
and MODEL_PATH = "saved_models/MODEL_closed_loop_0"
. So there seems to be something wrong in test_container.py
and/or agent.py
that it doesn't evaluate the manually set model number in agent.py
.
Update: it's just that I need to run make build
every time I changed the agent.py
. And when there is a conflict of the docker images' names or so, I need to delete the images, and then running make build
will take more than 5 minutes I think. That's why I tend to run docker-compose up -d --build
only but not the make build
.
Now it's working properly as attached:
Hi @jia-chenhua , you should only need to run the environment command once and leave it running in the background, i.e.
cd netzerotc
docker-compose up -d --build
For the agents, yes, we need to rebuild them when we make changes – otherwise the docker "tag", here submission-closed-loop:v0.1.0
for example, would still be pointing to the previously built image.
So the process is:
# make code changes
make build
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1
# make code changes
make build
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1
# keep repeating
In practice, we would keep "bumping" up the version number in the Makefile so that we have a history of agents:
submission-closed-loop:v0.1.0
submission-closed-loop:v0.1.1
submission-closed-loop:v0.1.2
and then we would be able to submit these independently using
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1
evalai push submission-closed-loop:v0.1.1 --phase nztc-closed-loop-1
evalai push submission-closed-loop:v0.1.2 --phase nztc-closed-loop-1
This pull request tests a pre-trained agent, i.e. an agent with more complexity than the random-actions agent of https://github.com/rangl-labs/netzerotc/tree/0f2307fa1c216367be55b73e0c16bfa06575920b/random_agent_submission
We're addressing: