irods / irods_client_globus_connector

The iRODS Globus Connector
2 stars 4 forks source link

Status info on the globus interface for transfers to/from irods becomes completed quite later than actual complete #105

Open mstfdkmn opened 1 month ago

mstfdkmn commented 1 month ago

When we transfer data to irods (or from irods to non-irods endpoint) I observe that the status information (Condition) on the globus interface is updated quite late. The complete duration report of the transfer on activity tab on the globus interface shows inconsistent results (huge difference for example for a 3gb file transfer, the transfer completed in one try in 2.5 minutes in another try in 9.5 minutes) for the different transfers of the same file. However this is not the case for the transfers between non-irods endpoints (not over the globus connector). This means somehow the connection seems to have a sticky session or the connector cannot close it immediately? And might this have any impact in the performance of irods or some more cpu load in the irods server?

An example for the 1Gb file transfer to irods:

The file is written and the status of the object become "good" in less than one minute in irods:

image

But what is observed on the globus interface is different:

image

trel commented 1 month ago

So you're suggesting that the iRODS/Globus connection is 'holding on' too long - and making Globus think the transfer hasn't yet completed, and then its 'completed' timestamp is 'later' than it needs to be...

But then, you also mention inconsistency in this 'later' amount? Or rather that a 3GB file itself takes varying amounts of time to transfer? Is that a different issue than the offset 'completed' time?

mstfdkmn commented 1 month ago

So you're suggesting that the iRODS/Globus connection is 'holding on' too long - and making Globus think the transfer hasn't yet completed, and then its 'completed' timestamp is 'later' than it needs to be...

Yes, it looks exactly like that.

But then, you also mention inconsistency in this 'later' amount? Or rather that a 3GB file itself takes varying amounts of time to transfer? Is that a different issue than the offset 'completed' time?

The inconsistency here means that in terms of the task duration observed on the globus interface, transfers' 'completed' timestamps on the globus are different, it doesn't say anything about the actual transfer time (it is always more or less same). So I think that is part of the same issue but no idea in which condition the task complete duration varies (I tested consecutively and got the different results).

mstfdkmn commented 1 month ago

With a suggestion from one of my colleagues, I tested it again without checksum enabled (do not verify the file integrity), then I see it normal. So checksum calculation seems to be a reason for this late close.

trel commented 1 month ago

So checksum calculation seems to be reason for this late close.

Oh! So Globus is doing its own checksumming and it takes some amount of time, and that time is the 'offset' that you're seeing between the iRODS completion time and the Globus completion time.

This makes sense, and seems to suggest that there is no action necessary? If no code changes... would it be helpful to have this behavior documented somewhere?

mstfdkmn commented 1 month ago

Related to #102? That could probably help to mitigate the problem (unless there are also other factors playing)?

trel commented 1 month ago

I would guess that #102 is very related. The iRODS client, the globus connector, is the one doing the client-side calculation... and would explain the delay you are seeing in the 'completed' timestamp.

I would assume that making the calculation more efficient/fast would reduce the delay.

trel commented 1 month ago

would it be helpful to have this behavior documented somewhere?

Where would you think a good/best place to document this behavior?

mstfdkmn commented 1 month ago

This makes sense, and seems to suggest that there is no action necessary? If no code changes... would it be helpful to have this behavior documented somewhere?

Yes, the only other thing is to have an idea of possible impacts of this on irods.

Where would you think a good/best place to document this behavior?

Readme of the connector repo?

JustinKyleJames commented 1 month ago

So checksum calculation seems to be reason for this late close.

Oh! So Globus is doing its own checksumming and it takes some amount of time, and that time is the 'offset' that you're seeing between the iRODS completion time and the Globus completion time.

This makes sense, and seems to suggest that there is no action necessary? If no code changes... would it be helpful to have this behavior documented somewhere?

This is true. We have to read the entire file after upload to do the checksum. Maybe at some point we could add a checksum API to iRODS so that this can be done on the server side.

The reason we can't use iRODS own checksum feature is due to the following:

  1. Globus wanted us to support checksums other than the two iRODS currently supports.
  2. The checksum in iRODS is a global setting. Globus clients choose the checksum that they want to use.
JustinKyleJames commented 1 month ago

This makes sense, and seems to suggest that there is no action necessary? If no code changes... would it be helpful to have this behavior documented somewhere?

Yes, the only other thing is to have an idea of possible impacts of this on irods.

Where would you think a good/best place to document this behavior?

Readme of the connector repo?

I could update the README to state this.

trel commented 1 month ago

Thanks Justin.

JustinKyleJames commented 1 month ago

I updated the README. See #107.