irods / irods_client_aws_lambda_s3

1 stars 3 forks source link

investigate intermittent failure of session.collections.create() #7

Open trel opened 4 years ago

trel commented 4 years ago

Sometimes, the recursive call to make sure the parent collection of the about-to-be-registered s3 data object exists... it raises CollectionDoesNotExist in its get() call.

Have not been able to replicate this behavior outside the Lambda environment.

https://github.com/irods/irods_client_aws_lambda_s3/blob/9f6706443f02019c2796c041c8648d62b13d6ec8/irods_client_aws_lambda_s3.py#L96-L100

trel commented 2 months ago

I think the get() at the end of create() is getting tripped up by being inside the AWS fabric and firing before the earlier COLL_CREATE_AN has a chance to 'settle'.

https://github.com/irods/python-irodsclient/blob/86a8f11a9399db29774c4096d83bba733a024ab6/irods/manager/collection_manager.py#L46

We can work around this scenario by catching the rare, but real, CollectionDoesNotExist exception and performing a small number of retries, perhaps with backoff.

Then, the lambda would not proceed until it got the 'go-ahead' that the Collection of interest does already exist in the iRODS catalog.

Something like...

                    try:
                        session.collections.create(irods_collection_name, recurse=True)
                    except CollectionDoesNotExist as e:
                        print('caught CollectionDoesNotExist, retrying...')
                        retries = 4
                        delay_in_seconds = 1.0
                        backoff_multiplier = 1.2
                        for i in range(retries):
                            retry_number = i+1
                            if retry_number == 1:
                                sleep_time = delay_in_seconds
                            else:
                                sleep_time = sleep_time * backoff_multiplier
                            print('retry [{}] ... sleeping for [{}]'.format(retry_number, sleep_time))
                            time.sleep(sleep_time)
                            try:
                                collection_created = session.collections.get(irods_collection_name)
                                break
                            except CollectionDoesNotExist:
                                pass
                        if not collection_created:
                            print('session.collections.create retried and still failed...')
                            raise e
                    except Exception as e:
                        print(e)

will create a PR for more easy commenting/review.

korydraughn commented 1 month ago

I think the get() at the end of create() is getting tripped up by being inside the AWS fabric and firing before the earlier COLL_CREATE_AN has a chance to 'settle'.

But, wouldn't that make it impossible to trust python code in AWS Lambda? The API request for the create op is supposed to be complete by the time the response is captured.

trel commented 1 month ago

Agreed. And yet… here we are seeing failures for the get().

trel commented 1 month ago

and remember, there is a network call to iRODS in there... serviced by... who knows what fabric in the middle...

in fact, two network calls - the mkdir API call, and then the query API to 'see' the newly created collection.