Computation of checksums can be improved

peterverraedt commented 3 months ago

Computation of checksum of large files is currently implemented using a single irods connection and a buffer size of 1M. See https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1647-L1683 for the relevant code.

For a 5G file, computing the checksum with a 1M read buffer takes about 25 seconds in our setup. The same file, with a read buffer of 32M, computing the checksum takes about 16 seconds.

For a 107G file, we get 8m29s for 1M and 5m48s for 32M.

Our setup is a mysql-backed irods server with a gpfs file system, with python rulesets, audit plugin enabled, and TLS enforced.

I suggest to at least increase the read buffer https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1647 to 32M. Higher values no longer speed up a single stream read.

Alternatively, it could be considered to limit checksum support to SHA256 https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1026 and compute the checksum server side with the built-in mechanisms. This would avoid having to transfer the complete contents of the file over - in this case - a TLS connection.

trel commented 3 months ago

Perhaps we define a sliding scale / heuristic where we allocate a larger read buffer for larger files.

JustinKyleJames commented 3 months ago

Perhaps we define a sliding scale / heuristic where we allocate a larger read buffer for larger files.

We could make this configurable with a better default.

As far as the server side and limiting to SHA256, it was my understanding that the community wanted to support the checksums we now support which requires client side calculations at the moment.

korydraughn commented 3 months ago

Or we expose options and let the admin pick the buffer size(s). If one buffer size doesn't work for all cases, then we expose buffer sizes for various thresholds.

trel commented 3 months ago

exposed options, with a default set/scale if not defined, seems the most flexible.

JustinKyleJames commented 2 months ago

One possibility is that we could offer a flag that indicates that we will use the checksum configured in iRODS. The flag would default to disabled. If the flag is set, the iRODS checksum (MD5 or SHA256) is returned and is not calculated on the client.

If the client requests a different checksum than is in iRODS, we would either reject the request or simply return the checksum that is in iRODS. Thoughts?

korydraughn commented 2 months ago

You're saying allow the admin to tell the globus connector to use the checksum stored in the catalog if available, correct?

Is that in addition to allowing the admin to adjust buffer sizes used for calculating checksums?

JustinKyleJames commented 2 months ago

You're saying allow the admin to tell the globus connector to use the checksum stored in the catalog if available, correct?

Is that in addition to allowing the admin to adjust buffer sizes used for calculating checksums?

Possibly yes on both.

korydraughn commented 2 months ago

Sounds like a good optimization to me.

Are globus users allowed to pick the type of checksum they want to use? If so, is their choice visible in the globus connector?

irods / irods_client_globus_connector

Computation of checksums can be improved #102