Closed sunitj closed 4 years ago
Hi @sunitj, thank you for reporting this. I am working on a fix for this issue, which was also reported in the idseq project. Your report will cause me to allocate more of my time to the fix to accelerate it. I'll update here ASAP.
In March 2017, EC2 gained the ability to tag resources at creation time.
In the shellcode for self-managed EBS volume attachment, the volume is created and tags are applied before the cleanup exit trap is set: https://github.com/kislyuk/aegea/blob/master/aegea/util/aws/batch.py#L20.
Plan:
This should be fixed in the latest release. The job should detach and destroy the volume upon any failure. I also recommend that instead of using EBS volumes, you look into using NVMe instance-attached storage. Automatic utilization of instance-attached storage is now available using the aegea batch submit --mount-instance-storage [mountpoint]
option (where mountpoint
is /mnt
by default).
If you want to switch from EBS to instance storage, the steps are as follows:
Locate suitable instance types at https://ec2instances.info. Enter "nvme" into the filter above the "Instance Storage" column. I recommend using instances from the m5d, r5d, or c5d family.
Create a new compute environment, or delete and re-create an existing compute environment, to utilize the instance types you selected (they don't support updating the instance types in-place).
If you've never worried about compute environments, and just ran aegea batch submit
with the default settings, then aegea created a compute environment called aegea_batch
for you. The new release of aegea uses spot m5d/r5d/c5d instances by default. To delete and re-create aegea_batch
, run:
aegea batch delete-queue aegea_batch
aegea batch delete-compute-environment aegea_batch
aegea batch submit --command "df -h" --mount-instance-storage --watch
That last command will re-create a new aegea_batch
compute environment, launch a test job, format and mount your instance storage, and print this free space report as the last log line, which you can use to verify success:
2019-09-09 05:05:52+00:00 /dev/md0 92G 61M 87G 1% /mnt
This issue should be resolved now, please reopen it if you still encounter difficulties with orphaned volumes.
Hi Andrey! Thank you for the quick turnaround! I updated to v2.6.6 today and retried my test job. However, the job failed again due to a (possibly) related error/bug. Here is the snippet of the error:
2019-09-16 18:10:08+00:00 Found existing installation: awscli 1.11.13
2019-09-16 18:10:08+00:00 Not uninstalling awscli at /usr/lib/python3/dist-packages, outside environment /usr
2019-09-16 18:10:10+00:00 Successfully installed aegea-2.6.6 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.238 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.228 botocore-1.12.228 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-16 18:10:56+00:00 Traceback (most recent call last):
2019-09-16 18:10:56+00:00 File "/usr/local/bin/aegea", line 23, in <module>
2019-09-16 18:10:56+00:00 aegea.main()
2019-09-16 18:10:56+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-16 18:10:56+00:00 result = parsed_args.entry_point(parsed_args)
2019-09-16 18:10:56+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-16 18:10:56+00:00 return attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-16 18:10:56+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 139, in attach
2019-09-16 18:10:56+00:00 logger.info("Formatting %s (%s)", args.volume_id, find_devnode(args.volume_id))
2019-09-16 18:10:56+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 109, in find_devnode
2019-09-16 18:10:56+00:00 raise Exception("Could not find devnode for {}".format(volume_id))
2019-09-16 18:10:56+00:00 Exception: Could not find devnode for vol-00db2784fc73d40a3
2019-09-16 18:10:56+00:00 Detaching EBS volume
2019-09-16 18:10:57+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-16 18:10:57+00:00 [--log-level {CRITICAL,DEBUG,WARNING,ERROR,INFO}]
2019-09-16 18:10:57+00:00 [--unmount] [--delete] [--force] [--dry-run]
2019-09-16 18:10:57+00:00 volume_id
2019-09-16 18:10:57+00:00 aegea ebs detach: error: the following arguments are required: volume_id
The complete error log is attached.
job.log
I checked my volumes and vol-00db2784fc73d40a3
was in fact created.
This feature is currently only compatible with m5/c5/r5 instance type families. I will look into making it compatible with other instance types, but I can't promise it. Please take a look at the suggested list of steps above.
OK, I tracked down the problem and released a fix in v2.6.8. Please try again - sorry about the breakage.
Just tried it and got the same error.
Can you double check the output of aegea --version
to make sure it says v2.6.8?
Are you running this command? aegea batch submit --queue aegea_batch --vcpus 16 --memory 64000 --ecr-image aligner --storage /mnt=500 --command 'PATH=$HOME/anaconda/bin:$PATH; cd utilities; git pull; git checkout master; python setup.py install; python -m utilities.alignment.run_star_and_htseq --taxon mm10-plus --num_partitions 100 --partition_id 0 --s3_input_path s3://czb-seqbot/fastqs/190906_A00111_0366_AHNKGFDSXX/ --s3_output_path s3://czb-maca/Plate_seq/parabiosis/190906_A00111_0366_AHNKGFDSXX/mm10/'
And your instance types are m3/c3/r3?
aegea is version 2.6.8 but this job queue is not m3, it's set to optimal
.
I'm sorry @jamestwebber, I did not properly test my fix, and managed to have a bug in the fix as well. Please try again with v2.6.9, which I think I tested properly and with success.
Just tried v2.6.9 with the same commands as before. Instances were launched successfully. Thanks!
Hi @kislyuk
It looks like the orphan volumes
issue wasn't resolved completely in v2.6.9. Some of my jobs were killed due to the following error, and the volumes associated with them were left behind as available:
...
...
Successfully installed aegea-2.6.9 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.240 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.230 botocore-1.12.230 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-18 05:29:34+00:00 WARNING:aegea:BDM node xvdz is already in use, looking for next available node
2019-09-18 05:29:35+00:00 Traceback (most recent call last):
2019-09-18 05:29:35+00:00 File "/usr/local/bin/aegea", line 23, in <module>
2019-09-18 05:29:35+00:00 aegea.main()
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-18 05:29:35+00:00 result = parsed_args.entry_point(parsed_args)
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-18 05:29:35+00:00 attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 126, in attach
2019-09-18 05:29:35+00:00 res = attach_volume(args)
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 87, in attach_volume
2019-09-18 05:29:35+00:00 Device=args.device)
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 357, in _api_call
2019-09-18 05:29:35+00:00 return self._make_api_call(operation_name, kwargs)
2019-09-18 05:29:35+00:00 File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 661, in _make_api_call
2019-09-18 05:29:35+00:00 raise error_class(parsed_response, operation_name)
2019-09-18 05:29:35+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the AttachVolume operation (reached max retries: 4): Request limit exceeded.
2019-09-18 05:29:35+00:00 Detaching EBS volume
2019-09-18 05:29:36+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-18 05:29:36+00:00 [--log-level {ERROR,CRITICAL,WARNING,INFO,DEBUG}]
2019-09-18 05:29:36+00:00 [--unmount] [--delete] [--force] [--dry-run]
2019-09-18 05:29:36+00:00 volume_id
2019-09-18 05:29:36+00:00 aegea ebs detach: error: the following arguments are required: volume_id
INFO:aegea:Job a0ac748d-da20-4b73-b63e-47330224c12d: Essential container in task exited
Here, is the detailed log a0ac748d-da20-4b73-b63e-47330224c12d.log
In this case, it possible that the warning about the selected device (xvdz
) not being available has something to do with it?
Thanks, I did not anticipate this error condition. The issue is not about xvdz
(the script automatically chose the next available letter, xvdy
), but that the failure to attach (due to too many concurrent requests) causes the volume ID to not be properly printed and saved for cleaning it up. I'll have a fix for you to try later today.
Released v2.6.11, which should address this issue by always saving the volume ID and trying to detach it even if errors are encountered in the volume management process. Please try again.
Closing due to inactivity.
Hello @kislyuk, Apologies for the delay, I wasn't running many jobs since my last comment and hence could not test the updates. However, recently, I started noticing an increase in the orphan volumes again. It seems like, if there is an error during execution, aegea seems to catch the exception and tries to detach the correct volume, but is unable to and eventually leaves an orphaned volume behind with the message:
2019-10-09 22:36:57+00:00 Detaching EBS volume vol-0fc487d57716a4ac2
2019-10-09 22:37:07+00:00 Traceback (most recent call last):
2019-10-09 22:37:07+00:00 File "/opt/aegea-venv/bin/aegea", line 23, in <module>
2019-10-09 22:37:07+00:00 aegea.main()
2019-10-09 22:37:07+00:00 File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/__init__.py", line 89, in main
2019-10-09 22:37:07+00:00 result = parsed_args.entry_point(parsed_args)
2019-10-09 22:37:07+00:00 File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/ebs.py", line 186, in detach
2019-10-09 22:37:07+00:00 Force=args.force)
2019-10-09 22:37:07+00:00 File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 357, in _api_call
2019-10-09 22:37:07+00:00 return self._make_api_call(operation_name, kwargs)
2019-10-09 22:37:07+00:00 File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 661, in _make_api_call
2019-10-09 22:37:07+00:00 raise error_class(parsed_response, operation_name)
2019-10-09 22:37:07+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DetachVolume operation (reached max retries: 4): Request limit exceeded.
Here is the detailed log and the command passed to the job (let me know if you need an unedited version) aegea version: 2.7.9
For comparison, here is a similar example in v2.6.11 where it seems to have worked: command, log
I suspect this has to do with concurrently launching many jobs, but I'll need to look at the logs more closely or experiment to confirm that.
Short of staggering the job launch, the main workaround for RequestLimitExceeded
that I can think of is to change the botocore config to increase retries.
With that said, I encourage you to stop using EBS volumes for your jobs, and to use instance attached storage instead. Going forward I intend to focus on only supporting instance storage as scratch space. Is there a specific reason preventing you from using instance storage for scratch space using the options I described above?
@kislyuk I haven't tested instance attached storage yet. I guess my concern was the limited instance family support and the need to setup another compute environment with these instance families (we are nearing our hard-limit of environments).
Neither of these are problems that we can't overcome. So, if that's where aegea is headed, then I'm happy to update my environment.
Closing this - please reopen if you still need support.
I tried launching over 1500 jobs using
aegea batch submit
. I some cases, it resulted in the following error:Since this error occurs right after the volume has been created, a tagging failure left uncaught simply kills the instance, but leaves an (or in my case 90!) orphaned volume behind.