kislyuk / aegea

Amazon Web Services Operator Interface
Apache License 2.0
68 stars 17 forks source link

How does aegea launch batch jobs? #52

Closed MrOlm closed 4 years ago

MrOlm commented 4 years ago

Hello,

I'm trying to do some local testing of my aws batch jobs as submitted through aegea, and I often get different results when running commands through aegea vs. running them locally. Here's an example of how I run the commands locally for testing:

docker run DOCKER_IMAGE_LOCATION /bin/bash -c "COMMAND_PASSED TO AEGEA HERE"

Given a Docker image location and a command, how is aegea running this information through aws batch?

Thank you in advance, Matt

MrOlm commented 4 years ago

In batch.py I see a lot of preamble stuff:


env_mgr_shellcode = """
set -a
if [ -f /etc/environment ]; then source /etc/environment; fi
if [ -f /etc/default/locale ]; then source /etc/default/locale; else export LC_ALL=C.UTF-8 LANG=C.UTF-8; fi
export AWS_DEFAULT_REGION={region}
set +a
if [ -f /etc/profile ]; then source /etc/profile; fi
set -euo pipefail
"""

apt_mgr_shellcode = """
sed -i -e "s|/archive.ubuntu.com|/{region}.ec2.archive.ubuntu.com|g" /etc/apt/sources.list
apt-get update -qq"""

ebs_vol_mgr_shellcode = apt_mgr_shellcode + """
apt-get install -qqy --no-install-suggests --no-install-recommends httpie awscli jq lsof python3-virtualenv > /dev/null
python3 -m virtualenv -q --python=python3 /opt/aegea-venv
/opt/aegea-venv/bin/pip install -q argcomplete requests boto3 tweak pyyaml
/opt/aegea-venv/bin/pip install -q --no-deps aegea=={aegea_version}
aegea_ebs_cleanup() {{ echo Detaching EBS volume $aegea_ebs_vol_id; cd /; /opt/aegea-venv/bin/aegea ebs detach --unmount --force --delete $aegea_ebs_vol_id; }}
trap aegea_ebs_cleanup EXIT
aegea_ebs_vol_id=$(/opt/aegea-venv/bin/aegea ebs create --size-gb {size_gb} --volume-type {volume_type} --tags managedBy=aegea batchJobId=$AWS_BATCH_JOB_ID --attach --format ext4 --mount {mountpoint} | jq -r .VolumeId)
"""  # noqa

efs_vol_shellcode = """mkdir -p {efs_mountpoint}
MAC=$(curl http://169.254.169.254/latest/meta-data/mac)
export SUBNET_ID=$(curl http://169.254.169.254/latest/meta-data/network/interfaces/macs/$MAC/subnet-id)
NFS_ENDPOINT=$(echo "$AEGEA_EFS_DESC" | jq -r ".[] | select(.SubnetId == env.SUBNET_ID) | .IpAddress")
mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 $NFS_ENDPOINT:/ {efs_mountpoint}"""

instance_storage_mgr_shellcode = apt_mgr_shellcode + """
apt-get install -qqy --no-install-suggests --no-install-recommends mdadm""" + instance_storage_shellcode

Is all of this run in the Docker container before the command passed to aegea? The specific problem that I'm facing is being unable to activate a conda environment in Docker before running my command (works locally but not when run through aegea), and I'm working if all this virtual environment stuff could be the issue.

Thanks, -Matt

MrOlm commented 4 years ago

Never mind! I was able to sort this out myself by adjusting the dry-run parameters, as specified in the pull request #53 .

-Matt

MrOlm commented 4 years ago

I have a related question- is there a reason that aegea attaches storage within the job itself, rather than specifying the storage in the job definition file and mounting it that way? Let me know if my question doesn't make sense and I can elaborate!

Thanks, Matt

MrOlm commented 4 years ago

The reason that I ask is that I believe by doing it this way, the queue that you submit to is unable to see the actual amount of resources that the job is going to take (including harddrive space). This means that if the queue is near it's space limit, the job will still try and go into the queue, but when it attempts to attach more space it will fail. I believe this what's happening with my jobs:

$ aegea batch watch 690399ad-7f81-4df6-9655-dfbe169e488e
INFO:aegea:Watching job 690399ad-7f81-4df6-9655-dfbe169e488e (aegea_util_aws_batch_jd_a05498b1_1)
INFO:aegea:Job 690399ad-7f81-4df6-9655-dfbe169e488e FAILED
INFO:aegea:Job 690399ad-7f81-4df6-9655-dfbe169e488e log stream: aegea_util_aws_batch_jd_a05498b1/default/909d9485-13d1-43a4-a5af-4afddd66cac3
2020-03-13 23:15:17+00:00 debconf: delaying package configuration, since apt-utils is not installed
2020-03-13 23:15:34+00:00 Already using interpreter /usr/bin/python3
2020-03-13 23:15:34+00:00 /usr/lib/python3/dist-packages/virtualenv.py:1090: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
2020-03-13 23:15:34+00:00   import imp
2020-03-13 23:16:03+00:00 Traceback (most recent call last):
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/bin/aegea", line 23, in <module>
2020-03-13 23:16:03+00:00     aegea.main()
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/__init__.py", line 89, in main
2020-03-13 23:16:03+00:00     result = parsed_args.entry_point(parsed_args)
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/ebs.py", line 65, in create
2020-03-13 23:16:03+00:00     attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/ebs.py", line 128, in attach
2020-03-13 23:16:03+00:00     res = attach_volume(args)
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/ebs.py", line 89, in attach_volume
2020-03-13 23:16:03+00:00     Device=args.device)
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
2020-03-13 23:16:03+00:00     return self._make_api_call(operation_name, kwargs)
2020-03-13 23:16:03+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/botocore/client.py", line 626, in _make_api_call
2020-03-13 23:16:03+00:00     raise error_class(parsed_response, operation_name)
2020-03-13 23:16:03+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the AttachVolume operation (reached max retries: 4): Request limit exceeded.
2020-03-13 23:16:03+00:00 Detaching EBS volume vol-00e35b137e3aeee8f
2020-03-13 23:16:04+00:00 Traceback (most recent call last):
2020-03-13 23:16:04+00:00   File "/opt/aegea-venv/bin/aegea", line 23, in <module>
2020-03-13 23:16:04+00:00     aegea.main()
2020-03-13 23:16:04+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/__init__.py", line 89, in main
2020-03-13 23:16:04+00:00     result = parsed_args.entry_point(parsed_args)
2020-03-13 23:16:04+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/ebs.py", line 181, in detach
2020-03-13 23:16:04+00:00     subprocess.call(cmd.format(devnode=find_devnode(volume_id)), shell=True)
2020-03-13 23:16:04+00:00   File "/opt/aegea-venv/lib/python3.7/site-packages/aegea/ebs.py", line 113, in find_devnode
2020-03-13 23:16:04+00:00     attachment = resources.ec2.Volume(volume_id).attachments[0]
2020-03-13 23:16:04+00:00 IndexError: list index out of range
INFO:aegea:Job 690399ad-7f81-4df6-9655-dfbe169e488e: Essential container in task exited
kislyuk commented 4 years ago

Right, great question. There are some complexities here that you have to take into account.

MrOlm commented 4 years ago

Awesome - thanks for all of this info Andrey. This is really helpful

-Matt

kislyuk commented 4 years ago

One more thing - when this happens, you will have orphan volumes left over from Batch jobs. Make sure to check your AWS account for orphan EBS volumes. You can do this at https://us-west-2.console.aws.amazon.com/ec2/v2/home#Volumes:state=available;tag:managedBy=aegea;sort=desc:createTime or by running aegea ebs ls --tag managedBy=aegea. You can delete them by selecting and deleting in the console or by running aegea rm $(aegea ebs ls --tag managedBy=aegea --json | jq -r '.[] | select(.state=="available") | .id') (it will ask you to confirm before deleting).