ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.92k stars 5.77k forks source link

[<Ray component: Core|Cluster>] Documentation instructions for mounting AWS EFS Fails for Ray Cluster #28057

Open davejscott opened 2 years ago

davejscott commented 2 years ago

What happened + What you expected to happen

Following this documentation page: https://docs.ray.io/en/latest/cluster/aws-tips.html#

I'm trying to mount an EFS to a ray cluster. When I follow the documentation instructions above, I'm not able to mount an EFS directory. The expected behaviour is the ability to mount an EFS drive and utilize it. I utilized the following documentation from AWS to set up my EFS, including security groups https://docs.aws.amazon.com/efs/latest/ug/wt1-create-ec2-resources.html. It appears from the bug that there are some issues mounting the drive.

If you utilize ubuntu user, the drive is capable of being mounted. When using ray attach maskrcnn-efs-mount.yaml and trying to manually attach the drive, it also provides the same result of mount.nfs4: Operation not permitted.

Example Yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-poc-maskrcnn

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: False
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    # Availability zone(s), comma-separated, that nodes may be launched in.
    # Nodes will be launched in the first listed availability zone and will
    # be tried in the subsequent availability zones if launching fails.
    availability_zone: us-west-2a #,us-west-2b
    # Whether to allow node reuse. If set to False, nodes will be terminated
    # instead of stopped.
    cache_stopped_nodes: True # If not present, the default is True.

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below.

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            SecurityGroupIds: [sg-01cb77c1bc9de6e42]
            SubnetIds: [subnet-0ae99d7d]
            InstanceType: m5.large
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # You can provision additional disk space with a conf as follows
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 200
            # Additional options in the boto docs.
    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 2
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The node type's CPU and GPU resources are auto-detected based on AWS instance type.
        # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler.
        # You can also set custom resources.
        # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set
        # resources: {"CPU": 1, "GPU": 1, "custom": 5}
        resources: {"GPU": 1}
        # Provider-specific config for this node type, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as SubnetId and KeyName.
        # For more documentation on available fields, see:
        # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
        node_config:
            SecurityGroupIds: [sg-01cb77c1bc9de6e42]
            SubnetIds: [subnet-0ae99d7d]
            InstanceType: g5.2xlarge
            ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30
            # Run workers on spot by default. Comment this out to use on-demand.
            # NOTE: If relying on spot instances, it is best to specify multiple different instance
            # types to avoid interruption when one instance type is experiencing heightened demand.
            # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/
            #InstanceMarketOptions:
            #    MarketType: spot
                # Additional options can be found in the boto docs, e.g.
                #   SpotOptions:
                #       MaxPrice: MAX_HOURLY_PRICE
            # Additional options in the boto docs.

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
    #
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands: #[]
    - sudo apt-get install -y lsof
    - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
        sudo pkill -9 apt-get;
        sudo pkill -9 dpkg;
        sudo dpkg --configure -a;
        sudo apt-get -y install binutils;
        cd $HOME;
        git clone https://github.com/aws/efs-utils;
        cd $HOME/efs-utils;
        ./build-deb.sh;
        sudo apt-get -y install ./build/amazon-efs-utils*deb;
        cd $HOME;
        mkdir efs-mount;
        sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-09735ae485f11eba2.efs.us-west-2.amazonaws.com:/ efs-mount;
        sudo chmod 777 efs-mount;  
    - pip install --upgrade pip
    - pip uninstall horovod -y
    - pip uninstall tensorflow -y
    - pip install tensorflow==2.4.0
    - HOROVOD_WITH_TENSORFLOW=1 pip install horovod[tensorflow] --no-cache-dir -v
    - pip install -r /home/maskrcnn/requirements.txt

    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {
}
worker_nodes: {
    }

Outcome after running

ray up maskrcnn-efs-mount.yaml

    (0/8) sudo apt-get install -y lsof
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  lsof
0 upgraded, 1 newly installed, 0 to remove and 8 not upgraded.
Need to get 248 kB of archives.
After this operation, 451 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 lsof amd64 4.89+dfsg-0.1 [248 kB]
Fetched 248 kB in 1s (355 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package lsof.
(Reading database ... 40638 files and directories currently installed.)
Preparing to unpack .../lsof_4.89+dfsg-0.1_amd64.deb ...
Unpacking lsof (4.89+dfsg-0.1) ...
Setting up lsof (4.89+dfsg-0.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
Shared connection to 34.216.166.249 closed.
    (1/8) sudo kill -9 `sudo lsof /var/l...

Usage:
 kill [options] <pid> [...]

Options:
 <pid> [...]            send signal to every <pid> listed
 -<signal>, -s, --signal <signal>
                        specify the <signal> to be sent
 -l, --list=[<signal>]  list all signal names, or convert one to a name
 -L, --table            list all signal names in a nice table

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
Reading package lists... Done
Building dependency tree       
Reading state information... Done
binutils is already the newest version (2.30-21ubuntu1~18.04.7).
binutils set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 8 not upgraded.
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
Cloning into 'efs-utils'...
remote: Enumerating objects: 1233, done.
remote: Counting objects: 100% (402/402), done.
remote: Compressing objects: 100% (95/95), done.
remote: Total 1233 (delta 321), reused 352 (delta 295), pack-reused 831
Receiving objects: 100% (1233/1233), 477.47 KiB | 5.82 MiB/s, done.
Resolving deltas: 100% (786/786), done.
+ pwd
+ BASE_DIR=/home/ray/efs-utils
+ BUILD_ROOT=/home/ray/efs-utils/build/debbuild
+ VERSION=1.33.3
+ RELEASE=1
+ DEB_SYSTEM_RELEASE_PATH=/etc/os-release
+ echo Cleaning deb build workspace
Cleaning deb build workspace
+ rm -rf /home/ray/efs-utils/build/debbuild
+ mkdir -p /home/ray/efs-utils/build/debbuild
+ echo Creating application directories
Creating application directories
+ mkdir -p /home/ray/efs-utils/build/debbuild/etc/amazon/efs
+ mkdir -p /home/ray/efs-utils/build/debbuild/etc/init/
+ mkdir -p /home/ray/efs-utils/build/debbuild/etc/systemd/system
+ mkdir -p /home/ray/efs-utils/build/debbuild/sbin
+ mkdir -p /home/ray/efs-utils/build/debbuild/usr/bin
+ mkdir -p /home/ray/efs-utils/build/debbuild/var/log/amazon/efs
+ mkdir -p /home/ray/efs-utils/build/debbuild/usr/share/man/man8
+ echo Copying application files
Copying application files
+ install -p -m 644 dist/amazon-efs-mount-watchdog.conf /home/ray/efs-utils/build/debbuild/etc/init
+ install -p -m 644 dist/amazon-efs-mount-watchdog.service /home/ray/efs-utils/build/debbuild/etc/systemd/system
+ install -p -m 444 dist/efs-utils.crt /home/ray/efs-utils/build/debbuild/etc/amazon/efs
+ install -p -m 644 dist/efs-utils.conf /home/ray/efs-utils/build/debbuild/etc/amazon/efs
+ install -p -m 755 src/mount_efs/__init__.py /home/ray/efs-utils/build/debbuild/sbin/mount.efs
+ install -p -m 755 src/watchdog/__init__.py /home/ray/efs-utils/build/debbuild/usr/bin/amazon-efs-mount-watchdog
+ echo Copying install scripts
Copying install scripts
+ install -p -m 755 dist/scriptlets/after-install-upgrade /home/ray/efs-utils/build/debbuild/postinst
+ install -p -m 755 dist/scriptlets/before-remove /home/ray/efs-utils/build/debbuild/prerm
+ install -p -m 755 dist/scriptlets/after-remove /home/ray/efs-utils/build/debbuild/postrm
+ echo Copying control file
Copying control file
+ install -p -m 644 dist/amazon-efs-utils.control /home/ray/efs-utils/build/debbuild/control
+ echo Copying conffiles
Copying conffiles
+ install -p -m 644 dist/amazon-efs-utils.conffiles /home/ray/efs-utils/build/debbuild/conffiles
+ echo Copying manpages
Copying manpages
+ install -p -m 644 man/mount.efs.8 /home/ray/efs-utils/build/debbuild/usr/share/man/man8/mount.efs.8
+ echo Creating deb binary file
Creating deb binary file
+ echo 2.0
+ echo Setting permissions
Setting permissions
+ find /home/ray/efs-utils/build/debbuild -type+  d
xargs chmod 755
+ echo Creating tar
Creating tar
+ cd /home/ray/efs-utils/build/debbuild
+ tar czf control.tar.gz control conffiles postinst prerm postrm --owner=0 --group=0
+ tar czf data.tar.gz etc sbin usr var --owner=0 --group=0
+ cd /home/ray/efs-utils
+ echo Building deb
Building deb
+ DEB=/home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb
+ ar r /home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb /home/ray/efs-utils/build/debbuild/debian-binary
ar: creating /home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb
+ ar r /home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb /home/ray/efs-utils/build/debbuild/control.tar.gz
+ ar r /home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb /home/ray/efs-utils/build/debbuild/data.tar.gz
+ echo Copying deb to output directory
Copying deb to output directory
+ cp /home/ray/efs-utils/build/debbuild/amazon-efs-utils-1.33.3-1_all.deb build/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'amazon-efs-utils' instead of './build/amazon-efs-utils-1.33.3-1_all.deb'
The following additional packages will be installed:
  dmsetup keyutils libcap2 libdevmapper1.02.1 libnfsidmap2 libtirpc1 libwrap0 netbase nfs-common rpcbind stunnel4
Suggested packages:
  open-iscsi watchdog logcheck-database
The following NEW packages will be installed:
  amazon-efs-utils dmsetup keyutils libcap2 libdevmapper1.02.1 libnfsidmap2 libtirpc1 libwrap0 netbase nfs-common rpcbind stunnel4
0 upgraded, 12 newly installed, 0 to remove and 8 not upgraded.
Need to get 823 kB/879 kB of archives.
After this operation, 2761 kB of additional disk space will be used.
Get:1 /home/ray/efs-utils/build/amazon-efs-utils-1.33.3-1_all.deb amazon-efs-utils all 1.33.3 [56.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libwrap0 amd64 7.6.q-27 [46.3 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 netbase all 5.4 [12.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 stunnel4 amd64 3:5.44-1ubuntu3 [151 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libdevmapper1.02.1 amd64 2:1.02.145-4.1ubuntu3.18.04.3 [127 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 dmsetup amd64 2:1.02.145-4.1ubuntu3.18.04.3 [74.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic/main amd64 libcap2 amd64 1:2.25-1.2 [13.0 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libnfsidmap2 amd64 0.25-5.1 [27.2 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libtirpc1 amd64 0.2.5-1.2ubuntu0.1 [75.7 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 rpcbind amd64 0.2.3-0.6ubuntu0.18.04.4 [42.1 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 keyutils amd64 1.5.9-9.2ubuntu2.1 [48.1 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 nfs-common amd64 1:1.3.4-2.1ubuntu5.5 [206 kB]
Fetched 823 kB in 3s (320 kB/s)      
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libwrap0:amd64.
(Reading database ... 40663 files and directories currently installed.)
Preparing to unpack .../00-libwrap0_7.6.q-27_amd64.deb ...
Unpacking libwrap0:amd64 (7.6.q-27) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_5.4_all.deb ...
Unpacking netbase (5.4) ...
Selecting previously unselected package stunnel4.
Preparing to unpack .../02-stunnel4_3%3a5.44-1ubuntu3_amd64.deb ...
Unpacking stunnel4 (3:5.44-1ubuntu3) ...
Selecting previously unselected package libdevmapper1.02.1:amd64.
Preparing to unpack .../03-libdevmapper1.02.1_2%3a1.02.145-4.1ubuntu3.18.04.3_amd64.deb ...
Unpacking libdevmapper1.02.1:amd64 (2:1.02.145-4.1ubuntu3.18.04.3) ...
Selecting previously unselected package dmsetup.
Preparing to unpack .../04-dmsetup_2%3a1.02.145-4.1ubuntu3.18.04.3_amd64.deb ...
Unpacking dmsetup (2:1.02.145-4.1ubuntu3.18.04.3) ...
Selecting previously unselected package libcap2:amd64.
Preparing to unpack .../05-libcap2_1%3a2.25-1.2_amd64.deb ...
Unpacking libcap2:amd64 (1:2.25-1.2) ...
Selecting previously unselected package libnfsidmap2:amd64.
Preparing to unpack .../06-libnfsidmap2_0.25-5.1_amd64.deb ...
Unpacking libnfsidmap2:amd64 (0.25-5.1) ...
Selecting previously unselected package libtirpc1:amd64.
Preparing to unpack .../07-libtirpc1_0.2.5-1.2ubuntu0.1_amd64.deb ...
Unpacking libtirpc1:amd64 (0.2.5-1.2ubuntu0.1) ...
Selecting previously unselected package rpcbind.
Preparing to unpack .../08-rpcbind_0.2.3-0.6ubuntu0.18.04.4_amd64.deb ...
Unpacking rpcbind (0.2.3-0.6ubuntu0.18.04.4) ...
Selecting previously unselected package keyutils.
Preparing to unpack .../09-keyutils_1.5.9-9.2ubuntu2.1_amd64.deb ...
Unpacking keyutils (1.5.9-9.2ubuntu2.1) ...
Selecting previously unselected package nfs-common.
Preparing to unpack .../10-nfs-common_1%3a1.3.4-2.1ubuntu5.5_amd64.deb ...
Unpacking nfs-common (1:1.3.4-2.1ubuntu5.5) ...
Selecting previously unselected package amazon-efs-utils.
Preparing to unpack .../11-amazon-efs-utils-1.33.3-1_all.deb ...
Unpacking amazon-efs-utils (1.33.3) ...
Setting up libnfsidmap2:amd64 (0.25-5.1) ...
Setting up libcap2:amd64 (1:2.25-1.2) ...
Setting up keyutils (1.5.9-9.2ubuntu2.1) ...
Setting up libdevmapper1.02.1:amd64 (2:1.02.145-4.1ubuntu3.18.04.3) ...
Setting up libtirpc1:amd64 (0.2.5-1.2ubuntu0.1) ...
Setting up dmsetup (2:1.02.145-4.1ubuntu3.18.04.3) ...
Setting up libwrap0:amd64 (7.6.q-27) ...
Setting up rpcbind (0.2.3-0.6ubuntu0.18.04.4) ...
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up netbase (5.4) ...
Setting up nfs-common (1:1.3.4-2.1ubuntu5.5) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline

Creating config file /etc/idmapd.conf with new version
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline
Adding system user `statd' (UID 102) ...
Adding new user `statd' (UID 102) with group `nogroup' ...
Not creating home directory `/var/lib/nfs'.
/var/lib/dpkg/info/nfs-common.postinst: 49: /var/lib/dpkg/info/nfs-common.postinst: systemctl: not found
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up stunnel4 (3:5.44-1ubuntu3) ...
Warning: The home dir /var/run/stunnel4 you specified can't be accessed: No such file or directory
Adding system user `stunnel4' (UID 103) ...
Adding new group `stunnel4' (GID 103) ...
Adding new user `stunnel4' (UID 103) with group `stunnel4' ...
Not creating home directory `/var/run/stunnel4'.
invoke-rc.d: could not determine current runlevel
invoke-rc.d: policy-rc.d denied execution of start.
Setting up amazon-efs-utils (1.33.3) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for libc-bin (2.27-3ubuntu1.6) ...
W: Target Packages (main/binary-amd64/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
W: Target Packages (main/binary-all/Packages) is configured multiple times in /etc/apt/sources.list.d/kubernetes.list:1 and /etc/apt/sources.list.d/kubernetes.list:2
mount.nfs4: Operation not permitted

Versions / Dependencies

Ray 1.13.0 Python 3.7

Reproduction script

ray up maskrcnn-efs-mount.yaml with the yaml file attached above should help to reproduce the script.

Issue Severity

High: It blocks me from completing my task.

davejscott commented 2 years ago

@DmitriGekhtman tagging you here for this particular bug.

barrettje commented 1 year ago

I was able to get an EFS share mounted in the docker container with the following modifications to my YAML file.

To get the EFS mounted to the EC2 host I moved the EFS setup to *initialization_commands". I had to install a newer version of Python 3 as the image default python3 is 3.6 which causes the mount command to fail due to the deprecation of Python 3.6.

initialization_commands: 
  - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
      sudo pkill -9 apt-get;
      sudo pkill -9 dpkg;
      sudo dpkg --configure -a;
      sudo apt-get -y install python3.8;
      sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1;
      sudo apt-get -y install binutils;
      cd $HOME;
      git clone https://github.com/aws/efs-utils;
      cd $HOME/efs-utils;
      ./build-deb.sh;
      sudo apt-get -y install ./build/amazon-efs-utils*deb;
      cd $HOME;
      sudo mkdir /mnt/efs;
      sudo mount -t efs { EFS File System ID }:/ /mnt/efs;
      sudo chmod 777 /mnt/efs;

The following will mount the /mnt/efs directory on the EC2 host to /mnt/efs in the container

docker:
  ....
  run_options:
    - --volume /mnt/efs:/mnt/efs
yang0110 commented 1 year ago

Many thanks for the solution. I manage to mount efs to ec2 and container

yang0110 commented 1 year ago

one comment, if Amazon Linux 2 is used, we have to replace apt-get by yum