google-deepmind / xmanager

A platform for managing machine learning experiments
Apache License 2.0
814 stars 45 forks source link

Help Building Docker Image #4

Closed joaogui1 closed 2 years ago

joaogui1 commented 3 years ago

Hey @andrewluchen can you help us build an image that runs dopamine? None of us has experience with it :(

andrewluchen commented 3 years ago

Hi Joao,

I can look into this. I'm not familiar with dopamine myself. Is this the library you are referring to? https://github.com/google/dopamine I can see if I can submit the cartpole example.

In terms of urgency, are you exploring a variety of frameworks, or is this the framework you will primarily be working with?

joaogui1 commented 3 years ago

This is the framework we'll be working with, yes Can you try MountainCar and LunarLander? I think there are some special installs for one of them

andrewluchen commented 3 years ago

Are you intending to modify dopamine or just use it as a library?

Try this as a starting point:

spec = xm.PythonContainer(
        base_image='gcr.io/deeplearning-platform-release/base-cpu',
        docker_instructions=[
            'RUN apt update && apt install -y python3-opencv',
            'RUN pip install dopamine-rl',
            'RUN mkdir workdir',
            f'RUN wget -O workdir/{gin_file} https://raw.githubusercontent.com/google/dopamine/master/{FLAGS.gin_file}',
            'WORKDIR workdir',
        ],
        entrypoint=xm.ModuleName('dopamine.discrete_domains.train'),
    )
andrewluchen commented 3 years ago

I put together an example that uploads tf events to Vertex Tensorboard:

https://github.com/deepmind/xmanager/blob/main/examples/dopamine/launcher.py

joaogui1 commented 3 years ago

Thanks for all the help @andrewluchen, do you think we could chat Thursday or Friday? I still have some doubts about using xmanager and I think showing them to you would be faster

joaogui1 commented 3 years ago

Hey @andrewluchen I got the following error when building the docker image

Dockerfile:

FROM gcr.io/deeplearning-platform-release/base-cu110

RUN apt update && apt install -y python3-opencv
RUN pip install dopamine-rl
COPY . workdir
WORKDIR workdir

COPY entrypoint.sh ./entrypoint.sh
RUN chmod +x ./entrypoint.sh
ENTRYPOINT ["./entrypoint.sh", "--env=cartpole", "--agent=dqn"]

...

 => [internal] load build definition from Dockerfile                                                                                                                                                            0.3s
 => => transferring dockerfile: 381B                                                                                                                                                                            0.0s
 => [internal] load .dockerignore                                                                                                                                                                               0.3s
 => => transferring context: 2B                                                                                                                                                                                 0.0s
 => [internal] load metadata for gcr.io/deeplearning-platform-release/base-cu110:latest                                                                                                                         0.0s
 => [1/8] FROM gcr.io/deeplearning-platform-release/base-cu110                                                                                                                                                  0.9s
 => [internal] load build context                                                                                                                                                                               0.2s
 => => transferring context: 510.23kB                                                                                                                                                                           0.0s
 => [2/8] RUN apt update && apt install -y python3-opencv                                                                                                                                                      66.7s
 => [3/8] RUN pip install dopamine-rl                                                                                                                                                                          64.6s
 => [4/8] COPY . workdir                                                                                                                                                                                        0.2s 
 => [5/8] WORKDIR workdir                                                                                                                                                                                       0.2s 
 => [6/8] COPY entrypoint.sh ./entrypoint.sh                                                                                                                                                                    0.2s 
 => [7/8] RUN chmod +x ./entrypoint.sh                                                                                                                                                                          0.4s 
 => ERROR [8/8] RUN chmod +x ./wrapped_entrypoint.sh  

But I don't know where are we mentioning an entrypoint.sh Any ideas how to fix it?

andrewluchen commented 3 years ago

Could you pass --wrap_late_bindings=False to your command?

If you are only launching one job per experiment, this won't be useful. This is a flag that is primarily used to support distributed multi-host training, as it enables us to share the address of jobs to each other, like this: https://github.com/deepmind/xmanager/blob/main/examples/cifar10_torch/launcher.py#L60

joaogui1 commented 3 years ago

Has xmanager been updated? I got a FATAL Flags parsing error: Unknown command line flag 'wrap_late_bindings' Also I sent you more sensitive details through email, thanks for all the help!

andrewluchen commented 3 years ago

How are you launching? I cd into examples/ and ran this which worked,

xmanager launch launcher.py -- --wrap_late_bindings=False

I also used the python cmd, which worked:

python3 launcher.py --wrap_late_bindings=False

joaogui1 commented 3 years ago

Looks like I was missing the first -- (before --wrap_late_bindings), it's working now, thanks!

joaogui1 commented 3 years ago

No my job erroed as follows: image

joaogui1 commented 3 years ago

Ok, this is just because it seems it didn't copy my main file to Google Cloud

joaogui1 commented 3 years ago

How do I tell it to copy my files to GCP @andrewluchen? I thought I just needed to pass path="."

joaogui1 commented 3 years ago

Any idea why xmanager isn't copying the whole directory @andrewluchen ?

andrewluchen commented 3 years ago

path="." should copy the entire directory that launcher.py is in. For example, if you have some something like /home/user/project/launcher.py and you run xmanager launcher.py, you should expect the entire contents of /home/user/project/ to be copied into your image.

Is that not the behavior that you observe? What is your directory structure like and what is your launcher script look like?

Could you also email me the URL of the image so I can check what was copied?

joaogui1 commented 3 years ago
.
├── agents
│   ├── dqn_agent_new.py
│   ├── external_configurations.py
│   ├── implicit_quantile_agent_new.py
│   ├── networks_new.py
│   ├── quantile_agent_new.py
│   └── rainbow_agent_new.py
├── Configs
│   ├── dqn_acrobot.gin
│    ... 
├── example.py
├── full_replay.py
├── __init__.py
├── launcher.py
├── main_offline_experiments.py
├── minatar_env.py
├── networks_new.py
├── off-lineish.py
├── offrunner.py
├── __pycache__
│   ├── launcher.cpython-38.pyc
│   └── launcher.cpython-39.pyc
├── replay_runner.py
├── seedmain_code_experiments.py
└── xtests.py

(xtests was previously xmanager_exp, I tried changing the name to avoid underscores but it didn't help) Will email you the URL

andrewluchen commented 2 years ago

Closing old issues.