irods / irods_client_globus_connector

The iRODS Globus Connector
2 stars 4 forks source link

`pep_api_data_obj_close_post` invoked more than one #106

Open mstfdkmn opened 3 months ago

mstfdkmn commented 3 months ago

We have a PEP rule that should be triggered by the globus writes in order to implement a policy (extract metadata and attach them to the object with/in a specific schema). What I observe is that pep_api_data_obj_close_post is triggered 2 times when a 3,2 Gb file is transferred to irods. We have a flat resource hierarchy btw.

Looks like related to this. So the close pep is somehow called two times.

trel commented 3 months ago

The original logs in that thread (https://github.com/irods/irods_client_globus_connector/issues/84#issuecomment-2043145273) do show two CLOSEs for the same PID... which is surprising.

mstfdkmn commented 3 months ago

Tested this again...I am observing that pep_api_data_obj_close_post is invoked only once for the transfers that are done without checksum enabled. For the transfers with the checksum enabled pep_api_data_obj_close_post is first fired after the data is written/closed in irods and the second time fired when the Globus connection completed (checksum completed). So hypothetically thinking; might it be the case that for checksuming it could be opening/closing the object again?

alanking commented 3 months ago

Ah, good observation. I think your assessment is correct. Checksums are calculated in this project via open/read/close, as seen here: https://github.com/irods/irods_client_globus_connector/blob/6be085cad8530f8ec5c29ca1006c90a3d1633917/DSI/globus_gridftp_server_iRODS.cpp#L1647-L1683

Took that snippet from this issue, which is related: https://github.com/irods/irods_client_globus_connector/issues/102

Does that fully explain this, then?

trel commented 3 months ago

This explains the additional OPEN as well. Yes?

alanking commented 3 months ago

I think it does, yes. I defer to @JustinKyleJames

JustinKyleJames commented 3 months ago

pep_api_data_obj_close_post

Yes, that would explain it.

alanking commented 3 months ago

Excellent. Should we link this issue in the README update in #107 as well? Feels like it is explained in there.

trel commented 3 months ago

I think so.

mstfdkmn commented 3 months ago

That is, if I could follow all correctly, it is not possible to "read" for checksumming during open/write/close when checksum is enabled. Is this correct? I am wondering because if this could have been possible, we would always see only one time the policy is fired but for the case checksum enabled it would be fired a bit delayed (after everything/upload completed).

JustinKyleJames commented 3 months ago

That is, if I could follow all correctly, it is not possible to "read" for checksumming during open/write/close when checksum is enabled. Is this correct? I am wondering because if this could have been possible, we would always see only one time the policy is fired but for the case checksum enabled it would be fired a bit delayed (after everything/upload completed).

It might be possible but I am not sure we get the bytes in the exact sequence from the client. In addition the writing is fanned out to multiple threads. I would have to investigate whether it is possible under those circumstances to calculate the checksum on the fly.

korydraughn commented 3 months ago

We should investigate whether there's a checksum algorithm that works for out-of-order reads/writes?