Open peterverraedt opened 3 months ago
Perhaps we define a sliding scale / heuristic where we allocate a larger read buffer for larger files.
Perhaps we define a sliding scale / heuristic where we allocate a larger read buffer for larger files.
We could make this configurable with a better default.
As far as the server side and limiting to SHA256, it was my understanding that the community wanted to support the checksums we now support which requires client side calculations at the moment.
Or we expose options and let the admin pick the buffer size(s). If one buffer size doesn't work for all cases, then we expose buffer sizes for various thresholds.
exposed options, with a default set/scale if not defined, seems the most flexible.
One possibility is that we could offer a flag that indicates that we will use the checksum configured in iRODS. The flag would default to disabled. If the flag is set, the iRODS checksum (MD5 or SHA256) is returned and is not calculated on the client.
If the client requests a different checksum than is in iRODS, we would either reject the request or simply return the checksum that is in iRODS. Thoughts?
You're saying allow the admin to tell the globus connector to use the checksum stored in the catalog if available, correct?
Is that in addition to allowing the admin to adjust buffer sizes used for calculating checksums?
You're saying allow the admin to tell the globus connector to use the checksum stored in the catalog if available, correct?
Is that in addition to allowing the admin to adjust buffer sizes used for calculating checksums?
Possibly yes on both.
Sounds like a good optimization to me.
Are globus users allowed to pick the type of checksum they want to use? If so, is their choice visible in the globus connector?
Computation of checksum of large files is currently implemented using a single irods connection and a buffer size of 1M. See https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1647-L1683 for the relevant code.
For a 5G file, computing the checksum with a 1M read buffer takes about 25 seconds in our setup. The same file, with a read buffer of 32M, computing the checksum takes about 16 seconds.
For a 107G file, we get 8m29s for 1M and 5m48s for 32M.
Our setup is a mysql-backed irods server with a gpfs file system, with python rulesets, audit plugin enabled, and TLS enforced.
I suggest to at least increase the read buffer https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1647 to 32M. Higher values no longer speed up a single stream read.
Alternatively, it could be considered to limit checksum support to SHA256 https://github.com/irods/irods_client_globus_connector/blob/main/DSI/globus_gridftp_server_iRODS.cpp#L1026 and compute the checksum server side with the built-in mechanisms. This would avoid having to transfer the complete contents of the file over - in this case - a TLS connection.