rangl-labs / netzerotc

MIT License
11 stars 2 forks source link

Test meaningful agent submission #106

Closed masonlr closed 2 years ago

masonlr commented 2 years ago

This pull request tests a pre-trained agent, i.e. an agent with more complexity than the random-actions agent of https://github.com/rangl-labs/netzerotc/tree/0f2307fa1c216367be55b73e0c16bfa06575920b/random_agent_submission

We're addressing:

masonlr commented 2 years ago

Testing this interactively using:

docker run -it --network evalai_rangl -e "RANGL_ENVIRONMENT_URL=http://nztc:5000" submission:v0.1.0 bash
masonlr commented 2 years ago
root@93916536fecb:/service# python agent.py
/usr/local/lib/python3.8/site-packages/stable_baselines3/common/save_util.py:166: UserWarning: Could not deserialize object lr_schedule. Consider using `custom_objects` argument to replace this object.
  warnings.warn(
Traceback (most recent call last):
  File "agent.py", line 32, in <module>
    action, _ = model.predict(obs, deterministic=True)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 552, in predict
    return self.policy.predict(observation, state, mask, deterministic)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 333, in predict
    observation, vectorized_env = self.obs_to_tensor(observation)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 250, in obs_to_tensor
    vectorized_env = is_vectorized_observation(observation, self.observation_space)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 360, in is_vectorized_observation
    return is_vec_obs_func(observation, observation_space)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 242, in is_vectorized_box_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.
masonlr commented 2 years ago

Was the model trained using python 3.7, then loaded into the container in python 3.8?

The warning,

UserWarning: Could not deserialize object lr_schedule.

goes away with a python 3.7 container, i.e.

# Dockerfile
FROM python:3.7-slim-buster
masonlr commented 2 years ago

This line

obs = client.env_reset(instance_id)

calls

    def env_reset(self, instance_id):
        route = "/v1/envs/{}/reset/".format(instance_id)
        resp = self._post_request(route, None)

        # NOTE: env.reset() currently has no return values
        # therefore, bypass the response
        # observation = resp["observation"]
        return None

which returns None.

masonlr commented 2 years ago

The problem is that a None action is being sent to the environment through

action, _ = model.predict(obs, deterministic=True)
masonlr commented 2 years ago

@tomaszkosmala can you test whether this works now?

This is working now locally for me:

$ ./test_container.py
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
89960716 -19586.064453125
89960716 -6958.7119140625
89960716 -5751.5830078125
89960716 15062.984375
89960716 68996.4833984375
89960716 -86104.2353515625
89960716 15411.642578125
89960716 -35618.56298828125
89960716 -55815.54296875
89960716 270644.3193359375
89960716 329819.423828125
89960716 -145948.919921875
89960716 -1222577.4523925781
89960716 80895.70971679688
89960716 -1252747.8397827148
89960716 -165756.15325927734
89960716 223061.31018066406
89960716 164271.6160888672
89960716 160648.85369873047
89960716 78311.91040039062
89960716
done True
89960716
DEBUG:__main__:Instance id: 89960716
score: -1589740.8124389648
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
72e0804d -17663.49951171875
72e0804d -39395.52001953125
72e0804d -14425.7255859375
72e0804d 4999.158203125
72e0804d 40396.4423828125
72e0804d 47344.49609375
72e0804d -23383.4521484375
72e0804d -69302.5126953125
72e0804d 44115.716796875
72e0804d 48256.90576171875
72e0804d -65672.86181640625
72e0804d -67500.90185546875
72e0804d -382326.2316894531
72e0804d 21526.798583984375
72e0804d -450947.19677734375
72e0804d -12376.946411132812
72e0804d 140515.20739746094
72e0804d 95121.47338867188
72e0804d 50256.24041748047
72e0804d 75582.12493896484
72e0804d
done True
72e0804d
DEBUG:__main__:Instance id: 72e0804d
score: -574880.2845458984
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
79653f5c -24304.501953125
79653f5c -7722.6708984375
79653f5c -40160.8115234375
79653f5c -62750.9736328125
79653f5c 15224.8818359375
79653f5c -20076.974609375
79653f5c 28812.7255859375
79653f5c 74819.01953125
79653f5c -28468.47802734375
79653f5c 84301.7275390625
79653f5c -163977.232421875
79653f5c 20448.7373046875
79653f5c -659855.4155273438
79653f5c 64545.4345703125
79653f5c -845908.8607788086
79653f5c -106891.1127319336
79653f5c 26186.35723876953
79653f5c -47604.048400878906
79653f5c 45113.413146972656
79653f5c 93230.39147949219
79653f5c
done True
79653f5c
DEBUG:__main__:Instance id: 79653f5c
score: -1555038.3922729492
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
8b894cb2 -20275.9384765625
8b894cb2 -11439.5419921875
8b894cb2 -8719.0009765625
8b894cb2 30840.8740234375
8b894cb2 -98824.7734375
8b894cb2 46068.30078125
8b894cb2 -211260.7265625
8b894cb2 -3565.8408203125
8b894cb2 60539.748046875
8b894cb2 -27943.009765625
8b894cb2 11512.95947265625
8b894cb2 79028.2294921875
8b894cb2 -461691.20068359375
8b894cb2 -40380.07861328125
8b894cb2 -668979.7713012695
8b894cb2 55951.021484375
8b894cb2 224907.7345275879
8b894cb2 102591.51126098633
8b894cb2 75655.54992675781
8b894cb2 204345.17309570312
8b894cb2
done True
8b894cb2
DEBUG:__main__:Instance id: 8b894cb2
score: -661638.7805175781
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
da8bddcf -21379.6787109375
da8bddcf -7959.3447265625
da8bddcf 17643.02734375
da8bddcf -27628.478515625
da8bddcf -108573.2392578125
da8bddcf -157613.6435546875
da8bddcf -76909.03857421875
da8bddcf -58217.59423828125
da8bddcf 112664.845703125
da8bddcf 149576.12353515625
da8bddcf -81137.97265625
da8bddcf 6548.05517578125
da8bddcf -536688.3666992188
da8bddcf 152630.26000976562
da8bddcf -739540.5778808594
da8bddcf -118758.14379882812
da8bddcf 97778.27947998047
da8bddcf 174423.04473876953
da8bddcf 153402.8550415039
da8bddcf 154750.94342041016
da8bddcf
done True
da8bddcf
DEBUG:__main__:Instance id: da8bddcf
score: -914988.6441650391
Evaluation completed using 5 seeds.
Final average score:  -1059257.3827880858
tomaszkosmala commented 2 years ago

There is no error when executing test_container.py, which is great. There are two issues remaining: [] the result differ slightly between from the local evaluation e.g. locally -1,044,402.4792280579 vs in the container -1,001,111.8701486206, [] the submission of the meaningful agent to evalai fails.

(the seeds in the folders local_agent_training_and_evaluation and meaningful_agent_submission are different, but I evaluated against the same set of seeds)

moriartyjm commented 2 years ago

I also have test_container.py executing successfully (with the caveat below). Now trying submission.

In case this is relevant: test_container.py works with 5 seeds, but when using the larger set https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/meaningful_agent_submission/seeds.csv it runs for a good while then exits with this error:

...
done True
ec57a1c5
DEBUG:__main__:Instance id: ec57a1c5
score: -1356883.9453735352
DEBUG:__main__:Created submission
DEBUG:__main__:Completed submission
output
3b17c193 -17767.06640625
3b17c193 -30234.115234375
3b17c193 -9174.501953125
3b17c193 36726.33251953125
3b17c193 -7420.02734375
3b17c193 -94552.37939453125
3b17c193 -78216.27978515625
3b17c193 13097.20703125
3b17c193 1307.369140625
3b17c193 158882.56762695312
3b17c193 126855.59545898438
3b17c193 -11982.5048828125
3b17c193 -300943.041015625
3b17c193 99772.11499023438
3b17c193 -674147.873840332
3b17c193 -184090.33795166016
3b17c193 53696.27404785156
3b17c193 48386.46057128906
3b17c193 101092.71270751953
3b17c193 20183.45538330078
3b17c193
done True
3b17c193
DEBUG:__main__:Instance id: 3b17c193
score: -748528.0383300781
Traceback (most recent call last):
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.41/containers/create?name=agent

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/johnmoriarty/rangl/netzerotc/meaningful_agent_submission/./test_container.py", line 23, in <module>
    submission = client.containers.run(
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/models/containers.py", line 819, in run
    container = self.create(image=image, command=command,
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/models/containers.py", line 878, in create
    resp = self.client.api.create_container(**create_kwargs)
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/container.py", line 428, in create_container
    return self.create_container_from_config(config, name)
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/container.py", line 439, in create_container_from_config
    return self._result(res, True)
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/Users/johnmoriarty/opt/miniconda3/lib/python3.9/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 409 Client Error for http+docker://localhost/v1.41/containers/create?name=agent: Conflict ("Conflict. The container name "/agent" is already in use by container "e38e366af014ba35e2b83cf3b8ee116059f6fdd887770a39de8feb4ec3850005". You have to remove (or rename) that container to be able to reuse that name.")
moriartyjm commented 2 years ago

@masonlr sorry for slight newbie question but in https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/random_agent_submission/README.md and https://github.com/rangl-labs/netzerotc/blob/meaningful_agent/meaningful_agent_submission/README.md , should we maybe write docker-compose up -d --build instead of docker-compose up --build, to avoid need for a new terminal?

moriartyjm commented 2 years ago

The meaningful agent pushed successfully but remote evaluation failed. The stderr file was:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/containers/create

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 812, in run
    detach=detach, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 870, in create
    resp = self.client.api.create_container(**create_kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/container.py", line 430, in create_container
    return self.create_container_from_config(config, name)
  File "/usr/local/lib/python3.7/site-packages/docker/api/container.py", line 441, in create_container_from_config
    return self._result(res, True)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/containers/create: Not Found ("No such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.41/images/614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0/json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/code/scripts/workers/submission_worker.py", line 491, in run_submission
    submission_metadata=submission_serializer.data,
  File "/tmp/tmp0yspabgj/compute/challenge_data/challenge_1/main.py", line 84, in evaluate
    f"RANGL_SEED={seed}",
  File "/usr/local/lib/python3.7/site-packages/docker/models/containers.py", line 814, in run
    self.client.images.pull(image, platform=platform)
  File "/usr/local/lib/python3.7/site-packages/docker/models/images.py", line 456, in pull
    repository, tag, '@' if tag.startswith('sha256:') else ':'
  File "/usr/local/lib/python3.7/site-packages/docker/models/images.py", line 316, in get
    return self.prepare_model(self.client.api.inspect_image(name))
  File "/usr/local/lib/python3.7/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/docker/api/image.py", line 254, in inspect_image
    self._get(self._url("/images/{0}/json", image)), True
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.7/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.ImageNotFound: 404 Client Error for http+docker://localhost/v1.41/images/614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0/json: Not Found ("no such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0: No such image: 614224286813.dkr.ecr.us-east-1.amazonaws.com/nztc-challenge-1-participant-team-2:9f585c2e-bd6e-49b3-9a5d-16958a2849b0")
masonlr commented 2 years ago

@masonlr sorry for slight newbie question but in https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/random_agent_submission/README.md and https://github.com/rangl-labs/netzerotc/blob/meaningful_agent/meaningful_agent_submission/README.md , should we maybe write docker-compose up -d --build instead of docker-compose up --build, to avoid need for a new terminal?

Yes this is fine. The main reason for the second terminal is to watch the logs and be able to quickly kill with ctrl+c. If you don't need that, then -d is fine.

masonlr commented 2 years ago

The main problem was that the VM ran out of disk space. Every time there is a submission it has to clone the image and run – it only has a 30GB drive at the moment.

I've added a 200GB drive and am in the process of moving the docker storage over to that.

I've submitted the meaningful agent image and the round trip is working.

masonlr commented 2 years ago

I'm creating a 512GB disk and mounting it to /mnt/challenge. I will then modify /etc/docker/demon.json with the following:

{
    "data-root": "/mnt/challenge",
    "storage-driver": "overlay2"
}

Then will restart docker:

sudo systemctl restart docker
masonlr commented 2 years ago

Round trip should be working now through evalai.

masonlr commented 2 years ago

I also have test_container.py executing successfully (with the caveat below). Now trying submission.

In case this is relevant: test_container.py works with 5 seeds, but when using the larger set https://github.com/rangl-labs/netzerotc/blob/613ce2880aee18862cf236efb4b10e6055274353/meaningful_agent_submission/seeds.csv it runs for a good while then exits with this error:

For this one, I've just added in a debug line:

for i, seed in enumerate(seeds):
    print(f"INFO: evaluating seed {i} of {len(seeds)}")
    ...

It's working okay on my laptop:

...
done True
9a2d8468
DEBUG:__main__:Instance id: 9a2d8468
score: -894961.9956665039
INFO: evaluating seed 73 of 100
...
done True
412b11d9
DEBUG:__main__:Instance id: 412b11d9
score: -367544.2293395996
Evaluation completed using 100 seeds.
Final average score:  -1001111.8701486206
tomaszkosmala commented 2 years ago

The roundtrip works for me as well. Currently I'm getting different results from local evaluation, local evaluation using test_container.py and remote on evalai (using 5 seeds).

moriartyjm commented 2 years ago

The roundtrip works for me as well. Currently I'm getting different results from local evaluation, local evaluation using test_container.py and remote on evalai (using 5 seeds).

The evaluation script in https://github.com/rangl-labs/netzerotc/tree/main/evaluation/evaluation_script generates the list of random seeds from a single random seed (namely, 3423232). Perhaps different Python environments handle this differently, potentially generating different lists?

    random.seed(3423232)  # set a seed so that we generate the same seed_list each time
    N_seeds = 5
    seed_list = [random.randint(0, 1e7) for _ in range(N_seeds)]
moriartyjm commented 2 years ago

@tomaszkosmala reports that

This suggests that the evalai closed-loop and open-loop phases might be identical. @masonlr can we check that the evalai closed-loop phase is using the closed-loop env.py please?

jia-chenhua commented 2 years ago

Although I got the make build (as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error with docker-compose up -d --build on the VM:

(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 1070, in up
  File "compose/cli/main.py", line 1066, in up
  File "compose/project.py", line 615, in up
  File "compose/service.py", line 346, in ensure_image_exists
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration".
[358537] Failed to execute script docker-compose
jia-chenhua commented 2 years ago

@tomaszkosmala reports that

  • However if an agent is trained in the open-loop env.py, local evaluation with the closed-loop env.py returns an error

This is perhaps because https://github.com/rangl-labs/netzerotc/blob/27e43fb1f8e61b3b053acd2773080a72f875b72d/local_agent_training_and_evaluation/evaluate.py#L10 is creating an open-loop env only. We might either change the above line to env = gym.make("rangl:nztc-closed-loop-v0"), or rename it as 'evaluate_open_loop.py' and then duplicate it and make the change and then save it as 'evaluate_closed_loop.py'.

moriartyjm commented 2 years ago

Although I got the make build (as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error with docker-compose up -d --build on the VM:

...
[358537] Failed to execute script docker-compose

Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/

jia-chenhua commented 2 years ago

Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/

I tried following the instructions at https://docs.docker.com/engine/install/ubuntu/, but got stuck at the following:

(base) jia-chen@trainVM:~$ sudo apt-get remove docker docker-engine docker.io containerd runc
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg != 
E: The list of sources could not be read.
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg != 
E: The list of sources could not be read.
(base) jia-chen@trainVM:~$ sudo apt-get update
E: Conflicting values set for option Signed-By regarding source https://download.docker.com/linux/ubuntu/ focal: /usr/share/keyrings/docker-archive-keyring.gpg != 
E: The list of sources could not be read.
(base) jia-chen@trainVM:~$ 
masonlr commented 2 years ago

Although I got the make build (as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error with docker-compose up -d --build on the VM:

(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 1070, in up
  File "compose/cli/main.py", line 1066, in up
  File "compose/project.py", line 615, in up
  File "compose/service.py", line 346, in ensure_image_exists
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration".
[358537] Failed to execute script docker-compose

This is due to how AWS credentials are set up on the VM: docker should be working correctly on that VM.

@jia-chenhua can you please run the following:

$ env | grep AWS_DEFAULT_REGION
AWS_DEFAULT_REGION=us-east-1

If AWS_DEFAULT_REGION is not set, can you please set it using:

export AWS_DEFAULT_REGION=us-east-1

and ideally add this same line to the end of ~/.bashrc.

masonlr commented 2 years ago

@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.

dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}

This should fix the docker-compose issue above.

masonlr commented 2 years ago

Although I got the make build (as in https://github.com/rangl-labs/netzerotc/tree/meaningful_agent/meaningful_agent_submission) running correctly, I got the following error with docker-compose up -d --build on the VM:

...
[358537] Failed to execute script docker-compose

Based on a Google search, perhaps Docker is not yet installed on the VM? The Docker website would have the instructions needed, you could try https://docs.docker.com/engine/install/

@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.

masonlr commented 2 years ago

@tomaszkosmala reports that

* If an agent is trained in the open-loop `env.py`, it can be successfully submitted to both the open-loop and the closed-loop phases on evalai, and evaluates successfully in both phases with exactly the same score

* However if an agent is trained in the open-loop `env.py`, local evaluation with the closed-loop `env.py` returns an error

* If an agent is trained in the closed-loop `env.py`, evaluation in the closed-loop phase of evalai returns an error.

This suggests that the evalai closed-loop and open-loop phases might be identical. @masonlr can we check that the evalai closed-loop phase is using the closed-loop env.py please?

There is now a script that trains two models:

There are also now dedicated folders for closed-loop and open-loop, i.e.

Results from testing this locally currently make sense: the closed-loop agent gives reproducible results with the closed-loop environment. (Similarly, the open-loop agent gives reproducible results with the open-loop environment).

If you submit the closed-loop agent to the open-loop environment there is an error. (Similarly, if you submit the open-loop agent to the closed-loop environment there is an error.)


Pre-trained agent: Closed loop Environment: Closed loop

Test 1:

Evaluation completed using 8 seeds.
Final average score:  83324.07849884033

Test 2:

Evaluation completed using 8 seeds.
Final average score:  83324.07849884033

Pre-trained agent: Open loop Environment: Closed loop

INFO: evaluating seed 0 of 8
Traceback (most recent call last):
  File "/Users/lrmason/github.com/rangl-labs/netzerotc/meaningful_agent_submission/closed_loop/./test_container.py", line 27, in <module>
    submission = client.containers.run(
  File "/Users/lrmason/miniconda3/envs/dev/lib/python3.10/site-packages/docker/models/containers.py", line 848, in run
    raise ContainerError(
docker.errors.ContainerError: Command 'None' in image 'submission-closed-loop:v0.1.0' returned non-zero exit status 1

Interactive debugging gives:

root@87fcc9c273a2:/service# python agent.py
Traceback (most recent call last):
  File "agent.py", line 35, in <module>
    action, _ = model.predict(obs, deterministic=True)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 473, in predict
    return self.policy.predict(observation, state, mask, deterministic)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 281, in predict
    vectorized_env = is_vectorized_observation(observation, self.observation_space)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 226, in is_vectorized_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape (16,) for Box environment, please use (1,) or (n_env, 1) for the observation shape.

Pre-trained agent: Open loop Environment: Open loop

Test 1:

Evaluation completed using 8 seeds.
Final average score:  -57440.163009643555

Test 2:

Evaluation completed using 8 seeds.
Final average score:  -57440.163009643555

Pre-trained agent: Closed loop Environment: Open loop

INFO: evaluating seed 0 of 8
Traceback (most recent call last):
  File "/Users/lrmason/github.com/rangl-labs/netzerotc/meaningful_agent_submission/open_loop/./test_container.py", line 27, in <module>
    submission = client.containers.run(
  File "/Users/lrmason/miniconda3/envs/dev/lib/python3.10/site-packages/docker/models/containers.py", line 848, in run
    raise ContainerError(
docker.errors.ContainerError: Command 'None' in image 'submission-open-loop:v0.1.0' returned non-zero exit status 1

Interactive debugging gives:

root@de9557b4c7a4:/service# python agent.py
Traceback (most recent call last):
  File "agent.py", line 35, in <module>
    action, _ = model.predict(obs, deterministic=True)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 473, in predict
    return self.policy.predict(observation, state, mask, deterministic)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 281, in predict
    vectorized_env = is_vectorized_observation(observation, self.observation_space)
  File "/usr/local/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 226, in is_vectorized_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape (1,) for Box environment, please use (16,) or (n_env, 16) for the observation shape.
masonlr commented 2 years ago

To test the above with EvalAi, there is a temporary challenge that corresponds to the meaningful_agent branch with 8 seeds for the evaluation: see https://github.com/rangl-labs/netzerotc/blob/a4082c2467fde333e3298712a7a571ff9175f6f0/evaluation/evaluation_script/main.py#L59-L66

Running the combinations gives:


Pre-trained agent: Closed loop Environment: Closed loop

evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4

The results file in the evalai frontend is:

{"Average Cost": -83324.07849884033}

Pre-trained agent: Closed loop Environment: Closed loop

evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4

The results file in the evalai frontend is:

{"Average Cost": 57440.163009643555}

Pre-trained agent: Open loop Environment: Closed loop

# after modifying the agent.py file to load the open loop model
# MODEL_PATH = "saved_models/MODEL_open_loop_0"
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4

Fails

test Submission ID: 22
Submission Status : failed
Execution Time (sec) : 6.2344
Submitted At : 01/11/22 03:39:11 AM

Pre-trained agent: Closed loop Environment: Open loop

# after modifying the agent.py file to load the closed loop model
# MODEL_PATH = "saved_models/MODEL_closed_loop_0"
evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4

Fails

test Submission ID: 24
Submission Status : failed
Execution Time (sec) : 2.129368
Submitted At : 01/11/22 03:43:11 AM
moriartyjm commented 2 years ago

To test the above with EvalAi, there is a temporary challenge that corresponds to the meaningful_agent branch with 8 seeds for the evaluation: see

Many thanks for testing this. I have now hidden the old challenge on EvalAI and left visible the 'temporary' challenge, which we can now use as the main challenge (as it has been fully tested).

moriartyjm commented 2 years ago

Merging because fully tested. I will formally introduce the participants to EvalAI and the repo now and we can take care of enlarging the set of seeds later this week.

jia-chenhua commented 2 years ago

@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.

Hi @masonlr , I'm now testing it on trainVM (51.140.94.162) with username jia-chen, and here are the outputs and error message:

(base) jia-chen@trainVM:~$ env | grep AWS_DEFAULT_REGION
(base) jia-chen@trainVM:~$ export AWS_DEFAULT_REGION=us-east-1
(base) jia-chen@trainVM:~$ env | grep AWS_DEFAULT_REGION
AWS_DEFAULT_REGION=us-east-1
(base) jia-chen@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}
(base) jia-chen@trainVM:~$ cd github-repo/netzerotc/
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/usr/bin/docker-credential-ecr-login', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 1070, in up
  File "compose/cli/main.py", line 1066, in up
  File "compose/project.py", line 615, in up
  File "compose/service.py", line 346, in ensure_image_exists
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-ecr-login exited with "ecr: could not list credentials: ecr: Failed to get authorization token: MissingRegion: could not find region configuration".
[370477] Failed to execute script docker-compose
(base) jia-chen@trainVM:~/github-repo/netzerotc$ 

with the following screenshot: image

moriartyjm commented 2 years ago

@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.

@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?

jia-chenhua commented 2 years ago

@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.

@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?

I'm on Vodafone's broadband at home, and the upload speed is below 10Mbps or 1MBytes/s. Not sure if this is faster or slower than copper. But anyway, since I don't have Docker working on Windows on my laptop (yet), submitting via VM is the only possible way for me.

masonlr commented 2 years ago

@moriartyjm which VM is this on? (trainVM?) Also, which user are you logged in as? (dev?) I've just tested docker-compose up from trainVM and it is working if I'm logged in as dev.

@masonlr I was just reporting Jia-Chen's experience. But actually it would be great if I could also submit via VM - at home we're still on copper (London eh!) so upload is slow. I can still login as dev to challenge1VM -- is this currently in use for anything important, or can I play with it?

@moriartyjm now has access to trainVM.

masonlr commented 2 years ago

@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.

dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}

This should fix the docker-compose issue above.

@jia-chenhua, you will need to remove the file at ~/.docker/config.json.

I can change user to your account and test this, i.e.

$ sudo su -- jia-chen
(base) jia-chen@trainVM:/home/dev/netzerotc$ cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}
(base) jia-chen@trainVM:/home/dev/netzerotc$ sudo rm -f /home/jia-chen/.docker/config.json
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Step 1/8 : FROM python:3.9-slim-buster
 ---> 2f2210ecbb1c
Step 2/8 : WORKDIR /service
 ---> Using cache
 ---> bf0b847c34f0
Step 3/8 : COPY rangl/requirements.txt .
...
jia-chenhua commented 2 years ago

@jia-chenhua, I've removed this file, which will turn off the connection to AWS ECR.

dev@trainVM:~$ sudo cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}

This should fix the docker-compose issue above.

@jia-chenhua, you will need to remove the file at ~/.docker/config.json.

I can change user to your account and test this, i.e.

$ sudo su -- jia-chen
(base) jia-chen@trainVM:/home/dev/netzerotc$ cat /home/jia-chen/.docker/config.json
{
  "credsStore": "ecr-login"
}
(base) jia-chen@trainVM:/home/dev/netzerotc$ sudo rm -f /home/jia-chen/.docker/config.json
(base) jia-chen@trainVM:~/github-repo/netzerotc$ docker-compose up -d --build
Building nztc
Step 1/8 : FROM python:3.9-slim-buster
 ---> 2f2210ecbb1c
Step 2/8 : WORKDIR /service
 ---> Using cache
 ---> bf0b847c34f0
Step 3/8 : COPY rangl/requirements.txt .
...

Thanks a lot! I can confirm it's working now, and I'll continue to test the submission.

jia-chenhua commented 2 years ago

@moriartyjm Now I'm pleased to confirm that the submission to http://submissions.rangl.org via trainVM is working properly:

I copied the brute-force trained model 717 for open-loop to meaningful_agent_submission/open_loop/saved_models, and then modified the meaningful_agent_submission/open_loop/agent.py to load it, and then run python test_container.py (after commenting out https://github.com/rangl-labs/netzerotc/blob/245a6072d748763d7013a7e680696429ea061b71/meaningful_agent_submission/open_loop/test_container.py#L37 following @masonlr 's help; maybe we need to fix it and push it), then I got:

done True
c8be5137
DEBUG:__main__:Instance id: c8be5137
score: 720166.0662841797
Evaluation completed using 8 seeds.
Final average score:  649451.1227722168

on the trainVM. Then I run evalai push submission-open-loop:v0.1.0 --phase nztc-open-loop-4 after login to http://submissions.rangl.org, where I can see my submission and the "Stdout file" shows:

done True
6987374b
instance_id: 6987374b
{'score': {'value1': 720166.0662841797}}
mean_score 649451.1227722168
phase_codename open-loop
output {'result': [{'test_split': {'Average Cost': -649451.1227722168}}], 'submission_result': {'Average Cost': -649451.1227722168}}
Completed evaluation for open-loop phase

which matches the local results on the trainVM

And for the closed loop, I copied the brute-force trained model 694 for closed loop to meaningful_agent_submission/closed_loop/saved_models, and then modified the meaningful_agent_submission/closed_loop/agent.py to load it, and then run python test_container.py (the command="sleep infinity", doesn't exist in https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_submission/closed_loop/test_container.py, so there is no need to comment it out), then I got:

099db9c1
done True
099db9c1
DEBUG:__main__:Instance id: 099db9c1
score: 704282.6439208984
Evaluation completed using 8 seeds.
Final average score:  501442.3190917969

on the trainVM. Then I run evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-4 after login to http://submissions.rangl.org, where I can see my submission and the "Stdout file" shows:

mean_score 501442.3190917969
phase_codename closed-loop
output {'result': [{'test_split': {'Average Cost': -501442.3190917969}}], 'submission_result': {'Average Cost': -501442.3190917969}}

which again, matches the local results on the trainVM

So now it seems working properly, although it's weird that the size of open-loop submission image is > 3GB while the size of closed-loop submission image is only ~8.2MB (Update: according to @masonlr, the smaller sized file contain incremental diffs, since there's a lot of overlap between the two images and docker caches the common layers). My two submissions above are currently shown in the leaderboard. @moriartyjm , please let me know if I should hide them from the leaderboard (Update: @masonlr has now reset the EvalAI system so all data I submitted has been cleared, so it's fine now, and since the submission is working via trainVM, I can re-do it whenever needed).

jia-chenhua commented 2 years ago

Additional thoughts:

masonlr commented 2 years ago

The sleep infinity problem is fixed as part of #46.

masonlr commented 2 years ago

The README files will also be updated as part of #46.

moriartyjm commented 2 years ago

@jia-chenhua As a final check, could you please compare the following evaluation results:

jia-chenhua commented 2 years ago

@jia-chenhua As a final check, could you please compare the following evaluation results:

  • Evaluation locally without a container (eg simply using the evaluate.py script)
  • Evaluation locally in a container with the test_container.py Many thanks!

I didn't get the same result on trainVM when running the https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_training/evaluate.py (after deleting all seeds in https://github.com/rangl-labs/netzerotc/blob/main/meaningful_agent_training/seeds.csv except the first 8) compared to running the test_container.py:

(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_training$ python evaluate.py
Mean reward of model closed_loop_694: 605108.5162811279
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_training$ python evaluate.py
Mean reward of model open_loop_717: 2022223.5828170776
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_submission/closed_loop$ python test_container.py
Final average score:  501442.3190917969
(base) jia-chen@trainVM:~/github-repo/netzerotc/meaningful_agent_submission/open_loop$ python test_container.py
Final average score:  649451.1227722168

Also, changing https://github.com/rangl-labs/netzerotc/blob/7b392a24bc0056bcdede35367ce5b8c07634ccd0/meaningful_agent_submission/closed_loop/agent.py#L9 and https://github.com/rangl-labs/netzerotc/blob/7b392a24bc0056bcdede35367ce5b8c07634ccd0/meaningful_agent_submission/open_loop/agent.py#L9 to other model numbers didn't change the result of python test_container.py: I tried MODEL_PATH = "saved_models/MODEL_open_loop_717" and MODEL_PATH = "saved_models/MODEL_closed_loop_694", but the results are exactly the same with MODEL_PATH = "saved_models/MODEL_open_loop_0" and MODEL_PATH = "saved_models/MODEL_closed_loop_0". So there seems to be something wrong in test_container.py and/or agent.py that it doesn't evaluate the manually set model number in agent.py.

jia-chenhua commented 2 years ago

Update: it's just that I need to run make build every time I changed the agent.py. And when there is a conflict of the docker images' names or so, I need to delete the images, and then running make build will take more than 5 minutes I think. That's why I tend to run docker-compose up -d --build only but not the make build.

Now it's working properly as attached: image image

image image

masonlr commented 2 years ago

Hi @jia-chenhua , you should only need to run the environment command once and leave it running in the background, i.e.

cd netzerotc
docker-compose up -d --build

For the agents, yes, we need to rebuild them when we make changes – otherwise the docker "tag", here submission-closed-loop:v0.1.0 for example, would still be pointing to the previously built image.

So the process is:

# make code changes
make build
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1

# make code changes
make build
evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1

# keep repeating

In practice, we would keep "bumping" up the version number in the Makefile so that we have a history of agents:

submission-closed-loop:v0.1.0
submission-closed-loop:v0.1.1
submission-closed-loop:v0.1.2

and then we would be able to submit these independently using

evalai push submission-closed-loop:v0.1.0 --phase nztc-closed-loop-1
evalai push submission-closed-loop:v0.1.1 --phase nztc-closed-loop-1
evalai push submission-closed-loop:v0.1.2 --phase nztc-closed-loop-1