Open trel opened 4 years ago
I think the get()
at the end of create()
is getting tripped up by being inside the AWS fabric and firing before the earlier COLL_CREATE_AN
has a chance to 'settle'.
We can work around this scenario by catching the rare, but real, CollectionDoesNotExist
exception and performing a small number of retries, perhaps with backoff.
Then, the lambda would not proceed until it got the 'go-ahead' that the Collection of interest does already exist in the iRODS catalog.
Something like...
try:
session.collections.create(irods_collection_name, recurse=True)
except CollectionDoesNotExist as e:
print('caught CollectionDoesNotExist, retrying...')
retries = 4
delay_in_seconds = 1.0
backoff_multiplier = 1.2
for i in range(retries):
retry_number = i+1
if retry_number == 1:
sleep_time = delay_in_seconds
else:
sleep_time = sleep_time * backoff_multiplier
print('retry [{}] ... sleeping for [{}]'.format(retry_number, sleep_time))
time.sleep(sleep_time)
try:
collection_created = session.collections.get(irods_collection_name)
break
except CollectionDoesNotExist:
pass
if not collection_created:
print('session.collections.create retried and still failed...')
raise e
except Exception as e:
print(e)
will create a PR for more easy commenting/review.
I think the
get()
at the end ofcreate()
is getting tripped up by being inside the AWS fabric and firing before the earlierCOLL_CREATE_AN
has a chance to 'settle'.
But, wouldn't that make it impossible to trust python code in AWS Lambda? The API request for the create op is supposed to be complete by the time the response is captured.
Agreed. And yet… here we are seeing failures for the get().
and remember, there is a network call to iRODS in there... serviced by... who knows what fabric in the middle...
in fact, two network calls - the mkdir API call, and then the query API to 'see' the newly created collection.
Sometimes, the recursive call to make sure the parent collection of the about-to-be-registered s3 data object exists... it raises CollectionDoesNotExist in its get() call.
Have not been able to replicate this behavior outside the Lambda environment.
https://github.com/irods/irods_client_aws_lambda_s3/blob/9f6706443f02019c2796c041c8648d62b13d6ec8/irods_client_aws_lambda_s3.py#L96-L100