irods / irods_resource_plugin_s3

S3-compatible storage resource plugin for iRODS
Other
12 stars 16 forks source link

Possible issue with retries on a put to S3 when using the iput -X flag. #2203

Open JustinKyleJames opened 4 months ago

JustinKyleJames commented 4 months ago

I saw that there were S3 errors when someone did an iput -r -X to an S3 resource.

I did a little testing where I created a directory of ten 100MiB files then recursively put that directory to an S3 resource. If I left it alone it passed.

However, if I did a control-c on the first iput, then restarted the iput (with the -f flag), I would sometimes get an S3 error. If I kept trying it would eventually pass.

My working theory is that because S3 must store some state in shared memory, the shared memory didn't get cleaned up on the exit and a retry would have an inconsistent state.

The shared memory logic does have a timeout so once that timeout expires the memory will be flushed. That could be why it eventually passes.

Some notes:

  1. This is just an educated guess at this point. I never actually reproduced the exact error (timeout reading circular buffer on part upload) that the user was seeing.
  2. I also saw failures when retrying after a control-C with iput -X to a unixfilesystem resource. One such failure was due to a file being left in intermediate state.
alanking commented 4 months ago

One note 2, did the data object in question eventually come out of the intermediate state due to the agent timing out and tearing itself down?

JustinKyleJames commented 4 months ago

One note 2, did the data object in question eventually come out of the intermediate state due to the agent timing out and tearing itself down?

It didn't but maybe I didn't wait long enough?

alanking commented 4 months ago

Okay, if possible, please confirm that the object is not stuck. If it is stuck, this is a situation we need to examine for the core code because that should not happen no matter what.

JustinKyleJames commented 4 months ago

Okay, if possible, please confirm that the object is not stuck. If it is stuck, this is a situation we need to examine for the core code because that should not happen no matter what.

Well now I can't reproduce it. ;-)

trel commented 4 months ago

iput and ctrl-c should send the signal to the server, and it should 'finalize'... but if it didn't get a chance to send anything, the server will wait for the timeout before finalizing.

or we have a bug, like alan said.