aws batch submission leaves orphan volumes behind due to an uncaught exception

sunitj commented 5 years ago

I tried launching over 1500 jobs using aegea batch submit. I some cases, it resulted in the following error:

...
...
Processing triggers for libc-bin (2.24-11+deb9u4) ...
Creating volume
An error occurred (RequestLimitExceeded) when calling the CreateTags operation (reached max retries: 4): Request limit exceeded.

Since this error occurs right after the volume has been created, a tagging failure left uncaught simply kills the instance, but leaves an (or in my case 90!) orphaned volume behind.

kislyuk commented 5 years ago

Hi @sunitj, thank you for reporting this. I am working on a fix for this issue, which was also reported in the idseq project. Your report will cause me to allocate more of my time to the fix to accelerate it. I'll update here ASAP.

kislyuk commented 5 years ago

In March 2017, EC2 gained the ability to tag resources at creation time.

In the shellcode for self-managed EBS volume attachment, the volume is created and tags are applied before the cleanup exit trap is set: https://github.com/kislyuk/aegea/blob/master/aegea/util/aws/batch.py#L20.

Plan:

Combine the create volume call and the tagging call into one
Move the cleanup trap up to be set before the initial create volume call
Modify the default shellcode to make use of instance local storage if available and running in privileged mode
Advise users to set up their Batch CE to use nvme-enabled instances to automatically provide scratch space on /mnt

kislyuk commented 5 years ago

This should be fixed in the latest release. The job should detach and destroy the volume upon any failure. I also recommend that instead of using EBS volumes, you look into using NVMe instance-attached storage. Automatic utilization of instance-attached storage is now available using the aegea batch submit --mount-instance-storage [mountpoint] option (where mountpoint is /mnt by default).

If you want to switch from EBS to instance storage, the steps are as follows:

Locate suitable instance types at https://ec2instances.info. Enter "nvme" into the filter above the "Instance Storage" column. I recommend using instances from the m5d, r5d, or c5d family.
Create a new compute environment, or delete and re-create an existing compute environment, to utilize the instance types you selected (they don't support updating the instance types in-place).
If you've never worried about compute environments, and just ran aegea batch submit with the default settings, then aegea created a compute environment called aegea_batch for you. The new release of aegea uses spot m5d/r5d/c5d instances by default. To delete and re-create aegea_batch, run:
```
aegea batch delete-queue aegea_batch
aegea batch delete-compute-environment aegea_batch
aegea batch submit --command "df -h" --mount-instance-storage --watch
```
That last command will re-create a new aegea_batch compute environment, launch a test job, format and mount your instance storage, and print this free space report as the last log line, which you can use to verify success:
```
2019-09-09 05:05:52+00:00 /dev/md0         92G   61M   87G   1% /mnt
```

kislyuk commented 5 years ago

This issue should be resolved now, please reopen it if you still encounter difficulties with orphaned volumes.

sunitj commented 5 years ago

Hi Andrey! Thank you for the quick turnaround! I updated to v2.6.6 today and retried my test job. However, the job failed again due to a (possibly) related error/bug. Here is the snippet of the error:

2019-09-16 18:10:08+00:00   Found existing installation: awscli 1.11.13
2019-09-16 18:10:08+00:00     Not uninstalling awscli at /usr/lib/python3/dist-packages, outside environment /usr
2019-09-16 18:10:10+00:00 Successfully installed aegea-2.6.6 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.238 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.228 botocore-1.12.228 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-16 18:10:56+00:00 Traceback (most recent call last):
2019-09-16 18:10:56+00:00   File "/usr/local/bin/aegea", line 23, in <module>
2019-09-16 18:10:56+00:00     aegea.main()
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-16 18:10:56+00:00     result = parsed_args.entry_point(parsed_args)
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-16 18:10:56+00:00     return attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 139, in attach
2019-09-16 18:10:56+00:00     logger.info("Formatting %s (%s)", args.volume_id, find_devnode(args.volume_id))
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 109, in find_devnode
2019-09-16 18:10:56+00:00     raise Exception("Could not find devnode for {}".format(volume_id))
2019-09-16 18:10:56+00:00 Exception: Could not find devnode for vol-00db2784fc73d40a3
2019-09-16 18:10:56+00:00 Detaching EBS volume
2019-09-16 18:10:57+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-16 18:10:57+00:00                         [--log-level {CRITICAL,DEBUG,WARNING,ERROR,INFO}]
2019-09-16 18:10:57+00:00                         [--unmount] [--delete] [--force] [--dry-run]
2019-09-16 18:10:57+00:00                         volume_id
2019-09-16 18:10:57+00:00 aegea ebs detach: error: the following arguments are required: volume_id

The complete error log is attached. job.log I checked my volumes and vol-00db2784fc73d40a3 was in fact created.

kislyuk commented 5 years ago

This feature is currently only compatible with m5/c5/r5 instance type families. I will look into making it compatible with other instance types, but I can't promise it. Please take a look at the suggested list of steps above.

kislyuk commented 5 years ago

OK, I tracked down the problem and released a fix in v2.6.8. Please try again - sorry about the breakage.

jamestwebber commented 5 years ago

Just tried it and got the same error.

kislyuk commented 5 years ago

Can you double check the output of aegea --version to make sure it says v2.6.8?

Are you running this command? aegea batch submit --queue aegea_batch --vcpus 16 --memory 64000 --ecr-image aligner --storage /mnt=500 --command 'PATH=$HOME/anaconda/bin:$PATH; cd utilities; git pull; git checkout master; python setup.py install; python -m utilities.alignment.run_star_and_htseq --taxon mm10-plus --num_partitions 100 --partition_id 0 --s3_input_path s3://czb-seqbot/fastqs/190906_A00111_0366_AHNKGFDSXX/ --s3_output_path s3://czb-maca/Plate_seq/parabiosis/190906_A00111_0366_AHNKGFDSXX/mm10/'

And your instance types are m3/c3/r3?

jamestwebber commented 5 years ago

aegea is version 2.6.8 but this job queue is not m3, it's set to optimal.

kislyuk commented 5 years ago

I'm sorry @jamestwebber, I did not properly test my fix, and managed to have a bug in the fix as well. Please try again with v2.6.9, which I think I tested properly and with success.

sunitj commented 5 years ago

Just tried v2.6.9 with the same commands as before. Instances were launched successfully. Thanks!

sunitj commented 5 years ago

Hi @kislyuk It looks like the orphan volumes issue wasn't resolved completely in v2.6.9. Some of my jobs were killed due to the following error, and the volumes associated with them were left behind as available:

...
...
Successfully installed aegea-2.6.9 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.240 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.230 botocore-1.12.230 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-18 05:29:34+00:00 WARNING:aegea:BDM node xvdz is already in use, looking for next available node
2019-09-18 05:29:35+00:00 Traceback (most recent call last):
2019-09-18 05:29:35+00:00   File "/usr/local/bin/aegea", line 23, in <module>
2019-09-18 05:29:35+00:00     aegea.main()
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-18 05:29:35+00:00     result = parsed_args.entry_point(parsed_args)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-18 05:29:35+00:00     attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 126, in attach
2019-09-18 05:29:35+00:00     res = attach_volume(args)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 87, in attach_volume
2019-09-18 05:29:35+00:00     Device=args.device)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 357, in _api_call
2019-09-18 05:29:35+00:00     return self._make_api_call(operation_name, kwargs)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 661, in _make_api_call
2019-09-18 05:29:35+00:00     raise error_class(parsed_response, operation_name)
2019-09-18 05:29:35+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the AttachVolume operation (reached max retries: 4): Request limit exceeded.
2019-09-18 05:29:35+00:00 Detaching EBS volume
2019-09-18 05:29:36+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-18 05:29:36+00:00                         [--log-level {ERROR,CRITICAL,WARNING,INFO,DEBUG}]
2019-09-18 05:29:36+00:00                         [--unmount] [--delete] [--force] [--dry-run]
2019-09-18 05:29:36+00:00                         volume_id
2019-09-18 05:29:36+00:00 aegea ebs detach: error: the following arguments are required: volume_id
INFO:aegea:Job a0ac748d-da20-4b73-b63e-47330224c12d: Essential container in task exited

Here, is the detailed log a0ac748d-da20-4b73-b63e-47330224c12d.log

In this case, it possible that the warning about the selected device (xvdz) not being available has something to do with it?

kislyuk commented 5 years ago

Thanks, I did not anticipate this error condition. The issue is not about xvdz (the script automatically chose the next available letter, xvdy), but that the failure to attach (due to too many concurrent requests) causes the volume ID to not be properly printed and saved for cleaning it up. I'll have a fix for you to try later today.

kislyuk commented 5 years ago

Released v2.6.11, which should address this issue by always saving the volume ID and trying to detach it even if errors are encountered in the volume management process. Please try again.

kislyuk commented 5 years ago

Closing due to inactivity.

sunitj commented 5 years ago

Hello @kislyuk, Apologies for the delay, I wasn't running many jobs since my last comment and hence could not test the updates. However, recently, I started noticing an increase in the orphan volumes again. It seems like, if there is an error during execution, aegea seems to catch the exception and tries to detach the correct volume, but is unable to and eventually leaves an orphaned volume behind with the message:

2019-10-09 22:36:57+00:00 Detaching EBS volume vol-0fc487d57716a4ac2
2019-10-09 22:37:07+00:00 Traceback (most recent call last):
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/bin/aegea", line 23, in <module>
2019-10-09 22:37:07+00:00     aegea.main()
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/__init__.py", line 89, in main
2019-10-09 22:37:07+00:00     result = parsed_args.entry_point(parsed_args)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/ebs.py", line 186, in detach
2019-10-09 22:37:07+00:00     Force=args.force)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 357, in _api_call
2019-10-09 22:37:07+00:00     return self._make_api_call(operation_name, kwargs)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 661, in _make_api_call
2019-10-09 22:37:07+00:00     raise error_class(parsed_response, operation_name)
2019-10-09 22:37:07+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DetachVolume operation (reached max retries: 4): Request limit exceeded.

Here is the detailed log and the command passed to the job (let me know if you need an unedited version) aegea version: 2.7.9

For comparison, here is a similar example in v2.6.11 where it seems to have worked: command, log

kislyuk commented 5 years ago

I suspect this has to do with concurrently launching many jobs, but I'll need to look at the logs more closely or experiment to confirm that.

Short of staggering the job launch, the main workaround for RequestLimitExceeded that I can think of is to change the botocore config to increase retries.

With that said, I encourage you to stop using EBS volumes for your jobs, and to use instance attached storage instead. Going forward I intend to focus on only supporting instance storage as scratch space. Is there a specific reason preventing you from using instance storage for scratch space using the options I described above?

sunitj commented 5 years ago

@kislyuk I haven't tested instance attached storage yet. I guess my concern was the limited instance family support and the need to setup another compute environment with these instance families (we are nearing our hard-limit of environments).

Neither of these are problems that we can't overcome. So, if that's where aegea is headed, then I'm happy to update my environment.

kislyuk commented 4 years ago

Closing this - please reopen if you still need support.

kislyuk / aegea

aws batch submission leaves orphan volumes behind due to an uncaught exception #47