irods / irods_client_globus_connector

The iRODS Globus Connector
2 stars 4 forks source link

Data upload/transfers from endpoints calls create/open/write/close #84

Open mstfdkmn opened 3 months ago

mstfdkmn commented 3 months ago

As far as I see, the globus connector first creates a data object/file in irods (create/close), and later writes in said data object (open/write) and lastly closes the data object. This seems not the same as other clients (prc, httapi) behaviors which call the same APIs.

Also, if the upload happens through the personnel connect/https, then open/create order seems changing, that is, first the open is called and later somewhere create is called.

And there is no way to create an empty data object in the globus interfaces. I was always supposing that the create operation exists to create an empty object.

Custom rules that have logic with pep_api_data_obj_open_*, pep_api_data_obj_create_*, pep_api_data_obj_close_* don't easily fit how globus works. But this seems not a big deal for now.

I am wondering why the globus connector each time first creates and later opens a data object instead of opening/writing while transferring data to irods? Also, would not this have a negative impact on the speed of transfer?

JustinKyleJames commented 3 months ago

I think I understand what you are seeing. The globus connector uses the numberOfIrodsReadWriteThreads environment variable to determine the number of threads that will be used to transfer data. You can set this in /etc/gridftp.conf as follows:

...
$numberOfIrodsReadWriteThreads 1
...

The default number of threads is 3 so each upload will spawn two additional threads making three total. Each of these open the file so that it can write.

As far as I can tell we do not have the file length so we can't make a decision based on the length of the file.

For file reads, there is another environment variable called irodsParallelFileSizeThresholdBytes. If that is set the globus connector will do a GenQuery to determine the file size and then make a decision on how many read threads are used. If it is not set it just uses numberOfIrodsReadWriteThreads.

JustinKyleJames commented 3 months ago

All that said, we don't do a create/close/open. We do perform open/create if the initial data object does not exist.

Please let me know if what you are seeing does not align with either this comment or the previous comment.

If the issue is the multiple threads I'm not sure what can be done about if you still want to have multiple threads for large files.

If the issue is the open/create when the file does not exist, we might be able to change that to a single open with the O_CREAT flag.

mstfdkmn commented 2 months ago

I will try to explain my observations:

We do perform open/create if the initial data object does not exist.

If the issue is the open/create when the file does not exist, we might be able to change that to a single open with the O_CREAT flag.

Maybe I didn't get why the open/create here but I checked other clients, e.g the prc does only open (not open/create) if the initial data object does not exist. My simple impression is that I think it could be better if follows the same like other clients using the same apis in terms of always seeing same results. But no idea about the performance point of view.

I am sorry if I didn't get what is meant, also I am not an expert but I should say I checked with the following pep and it contains the file size. I mean the first thread knows already the file size I guess.

def pep_api_data_obj_open_post(rule_args, callback, rei):
    iprint(f"{rule_args[2].dataSize}", callback)
alanking commented 2 months ago

@JustinKyleJames - Close if complete. Thanks

JustinKyleJames commented 2 months ago

We have fixed the create/open to only be open with the O_CREAT flag.

As for the multiple threads and file size, I am not sure how you are getting the file size. I do not see it. Here is my dataObjInp for pep_api_data_obj_open_post operation:

DATA_OBJ_INP=create_mode=0++++data_size=0++++num_threads=0++++obj_path=/tempZone/home/user1/abc.txt++++offset=0++++open_flags=1++++opr_type=0++++replicaToken=da989057-e162-48ae-9eea-6fc38e65e03a++++resc_hier=demoResc++++selected_hierarchy=demoResc

Here is my rule:

pep_api_data_obj_open_post(*INSTANCE_NAME, *COMM, *DATA_OBJ_INP) {
    writeLine("serverLog", "DATA_OBJ_INP=*DATA_OBJ_INP")
}

@mstfdkmn, are you sure the write where you got the file size was via Globus?

As for the create, yes, we've always done only one create. Now we will be doing one open with O_CREAT flag set and then other opens for the other threads.

I'll keep this open for now until we determine whether or not Globus is providing the file size. I don't see any way it is provided and the plugin was not setting it. (It was setting it to zero.)

mstfdkmn commented 2 months ago

Himm a bit strange!. Although the PRC and globus trigger the same PEP, the amount of the serialized parameters (keys) is not the same in both calls done by the PRC and globus. Globus seems to behave differently than how the PRC does. I mean the globus call for example doesn't contain dataSize and data_size whereas the PRC does. Please have a look at my logs below for more.

And can we think this will change with your O_CREAT flag fix ?

My rule:

pep_api_data_obj_open_post(*INSTANCE_NAME, *COMM, *DATA_OBJ_INP) {
    writeLine("serverLog", "all inputs: *DATA_OBJ_INP")
    *size = *DATA_OBJ_INP.dataSize
    writeLine("serverLog", "data object size is: *size")
    *client = *COMM.option
    writeLine("serverLog", "client name is: *client")
    5000000
}

PRC call result:

[2024-04-11T07:51:41.869Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = all inputs: create_mode=0++++dataSize=28++++data_size=28++++destRescName=default++++num_threads=0++++obj_path=/ghum/home/u0137480/abc.txt++++offset=0++++openType=1++++open_flags=577++++opr_type=0++++resc_hier=default;netapp++++selObjType=dataObj++++selected_hierarchy=default;netapp\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"u0137480","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890523,"server_timestamp":"2024-04-11T07:51:41.869Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:51:41.869Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = client name is: python-irodsclient\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"u0137480","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890523,"server_timestamp":"2024-04-11T07:51:41.869Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:51:41.869Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = data object size is: 28\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"u0137480","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890523,"server_timestamp":"2024-04-11T07:51:41.869Z","server_type":"agent","server_zone":"ghum"}

globus call result:

[2024-04-11T07:55:30.926Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"unmatched key dataSize\nline 2, col 12, rule base rules-global\n    *size = *DATA_OBJ_INP.dataSize\n            ^\n\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890793,"server_timestamp":"2024-04-11T07:55:30.926Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:30.926Z][icts-p-cloud-rdm-hev-1] irods stdout | line 2, col 12, rule base rules-global
[2024-04-11T07:55:30.986Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = all inputs: create_mode=384++++data_size=0++++num_threads=0++++obj_path=/ghum/home/u0137480/alice.txt++++offset=0++++open_flags=0++++opr_type=0++++resc_hier=default;netapp++++selected_hierarchy=default;netapp\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890781,"server_timestamp":"2024-04-11T07:55:30.986Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:30.987Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"execRuleNodeRes: applyRule Failed: pep_api_data_obj_open_post with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890781,"server_timestamp":"2024-04-11T07:55:30.986Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:30.987Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"executeRuleBody: Microservice or Action assign Failed with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890781,"server_timestamp":"2024-04-11T07:55:30.986Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:31.044Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"readAndProcClientMsg:readMsgHeader error. status = -4000 status = 0 Unknown iRODS error","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890793,"server_timestamp":"2024-04-11T07:55:31.044Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:31.044Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"readAndProcClientMsg:readMsgHeader error. status = -4000 status = 0 Unknown iRODS error","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890787,"server_timestamp":"2024-04-11T07:55:31.043Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:31.732Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"agent","log_level":"error","log_message":"Agent [890787] exiting with status = -4000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890787,"server_timestamp":"2024-04-11T07:55:31.731Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:55:31.889Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"agent_factory","log_level":"error","log_message":"Agent process [890793] exited with status [1].","server_host":"ghum.irods.icts.kuleuven.be","server_pid":115,"server_timestamp":"2024-04-11T07:55:31.889Z","server_type":"agent_factory","server_zone":"ghum"}
[2024-04-11T07:55:31.738Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"agent_factory","log_level":"error","log_message":"Agent process [890787] exited with status [1].","server_host":"ghum.irods.icts.kuleuven.be","server_pid":115,"server_timestamp":"2024-04-11T07:55:31.738Z","server_type":"agent_factory","server_zone":"ghum"}
[2024-04-11T07:55:31.862Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"agent","log_level":"error","log_message":"Agent [890793] exiting with status = -4000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890793,"server_timestamp":"2024-04-11T07:55:31.862Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:06.822Z][icts-p-cloud-rdm-hev-1] rsyslogd stdout | 2024-04-11T09:56:06.822286+02:00 ghum runuser: pam_unix(runuser:session): session opened for user irods by (uid=0)
[2024-04-11T07:56:07.816Z][icts-p-cloud-rdm-hev-1] rsyslogd stdout | 2024-04-11T09:56:07.816078+02:00 ghum runuser: pam_unix(runuser:session): session closed for user irods
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | line 2, col 12, rule base rules-global
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | Apr 11 09:56:08 pid:890858 DEBUG: unmatched key dataSize
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = all inputs: create_mode=0++++data_size=0++++num_threads=0++++obj_path=/ghum/home/u0137480/alice.txt++++offset=0++++open_flags=513++++opr_type=0++++rescName=default++++resc_hier=default;netapp++++selected_hierarchy=default;netapp\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.946Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout |             ^
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout |     *size = *DATA_OBJ_INP.dataSize
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"executeRuleBody: Microservice or Action assign Failed with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.946Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"unmatched key dataSize\nline 2, col 12, rule base rules-global\n    *size = *DATA_OBJ_INP.dataSize\n            ^\n\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.947Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"execRuleNodeRes: applyRule Failed: pep_api_data_obj_open_post with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.946Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.947Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"executeRuleAction Failed for assign status = -313000 UNMATCHED_KEY_OR_INDEX","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.946Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.972Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"[rsDataObjOpen_impl:920] - [HIERARCHY_ERROR: no valid resource found for data object\n\n] [error_code=[-1803000], path=[/ghum/home/u0137480/alice.txt], hierarchy=[]","request_api_name":"DATA_OBJ_CREATE_AN","request_api_number":601,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.972Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:56:08.991Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"[rsDataObjOpen_impl:920] - [HIERARCHY_ERROR: no valid resource found for data object\n\n] [error_code=[-1803000], path=[/ghum/home/u0137480/alice.txt], hierarchy=[]","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890858,"server_timestamp":"2024-04-11T07:56:08.991Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout |             ^
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout |     *size = *DATA_OBJ_INP.dataSize
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | line 2, col 12, rule base rules-global
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | Apr 11 09:57:01 pid:890920 DEBUG: unmatched key dataSize
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"execRuleNodeRes: applyRule Failed: pep_api_data_obj_open_post with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.253Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"executeRuleAction Failed for assign status = -313000 UNMATCHED_KEY_OR_INDEX","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.253Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"unmatched key dataSize\nline 2, col 12, rule base rules-global\n    *size = *DATA_OBJ_INP.dataSize\n            ^\n\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.253Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"executeRuleBody: Microservice or Action assign Failed with status -313000","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.253Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.253Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"writeLine: inString = all inputs: create_mode=0++++data_size=0++++num_threads=0++++obj_path=/ghum/home/u0137480/alice.txt++++offset=0++++open_flags=513++++opr_type=0++++rescName=default++++resc_hier=default;netapp++++selected_hierarchy=default;netapp\n","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.252Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.270Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"[rsDataObjOpen_impl:920] - [HIERARCHY_ERROR: no valid resource found for data object\n\n] [error_code=[-1803000], path=[/ghum/home/u0137480/alice.txt], hierarchy=[]","request_api_name":"DATA_OBJ_CREATE_AN","request_api_number":601,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.270Z","server_type":"agent","server_zone":"ghum"}
[2024-04-11T07:57:01.288Z][icts-p-cloud-rdm-hev-1] irods stdout | {"log_category":"legacy","log_level":"error","log_message":"[rsDataObjOpen_impl:920] - [HIERARCHY_ERROR: no valid resource found for data object\n\n] [error_code=[-1803000], path=[/ghum/home/u0137480/alice.txt], hierarchy=[]","request_api_name":"DATA_OBJ_OPEN_AN","request_api_number":602,"request_api_version":"d","request_client_user":"u0137480","request_host":"127.0.0.1","request_proxy_user":"globus","request_release_version":"rods4.3.1","server_host":"ghum.irods.icts.kuleuven.be","server_pid":890920,"server_timestamp":"2024-04-11T07:57:01.288Z","server_type":"agent","server_zone":"ghum"}
JustinKyleJames commented 2 months ago

That is because the PRC sets the file size because it knows the file size. Unfortunately, as far as I can tell the file size is not provided from Globus so we don't set it.

trel commented 2 months ago

Can we ... ask Globus about this?

JustinKyleJames commented 2 months ago

Can we ... ask Globus about this?

Yes, I will do that.

JustinKyleJames commented 2 months ago

I did find out that if the client sets it up properly, the transfer_info->alloc_size is set to the uploaded file size. I verified that this is the case with globus-url-transfer. It is not the case if using generic ftp commands.

I have made the following changes to the code. If the following conditions are met we will only use one thread for uploads:

  1. The transfer_info->alloc_size is set (non-zero).
  2. The irodsParallelFileSizeThresholdBytes is set (non-zero).
  3. The alloc_size < irodsParallelFileSizeThresholdBytes.

In any other case we will use the numberOfIrodsReadWriteThreads setting since we either don't know the file size or have not set a threshold.

@trel @korydraughn - Does that sound acceptable or should we have a default threshold? In either case, if alloc_size is not set I think we need to keep the number of transfer threads as the file could be large - we just don't know.

So here is an example configuration in /etc/gridftp.conf:

port 2811
$LD_LIBRARY_PATH "$LD_LIBRARY_PATH:/iRODS_DSI"
$irodsConnectAsAdmin "rods"
$numberOfIrodsReadWriteThreads 3
$irodsParallelFileSizeThresholdBytes 33554432
$spOption irods_client_globus_connector
#$irodsResourceMap "/etc/resource_mapfile"
load_dsi_module iRODS
auth_level 4

allow_anonymous 1
anonymous_names_allowed user1
anonymous_user user1

In this case we use 3 threads for any transfer above 32 MiB or if we do not know the file size.

JustinKyleJames commented 2 months ago

@trel @korydraughn - One additional note to the question above. The reason we don't currently have a default threshold is because this threshold was originally only for downloads. In the download case we have to do a query to iRODS to determine the file size. An administrator may choose to not set $irodsParallelFileSizeThresholdBytes as they may feel using multiple threads on small files is better than always doing a query to the iRODS database.

This isn't really a consideration for uploads as we don't do a query for the file size. It is either provided or not. We could have a default threshold that only applies to uploads.

korydraughn commented 2 months ago

It sounds like irodsParallelFileSizeThresholdBytes has a defined purpose, which is specific to the download case. If that's true, I think we have to consider whether uploads should have a dedicated option or not.

As for the conditions you proposed in https://github.com/irods/irods_client_globus_connector/issues/84#issuecomment-2057226127, those sound fine.

Have you received a response from Globus about whether the file size is provided (and when it's provided)?

JustinKyleJames commented 2 months ago

Have you received a response from Globus about whether the file size is provided (and when it's provided)?

Yes, the info in the first paragraph here https://github.com/irods/irods_client_globus_connector/issues/84#issuecomment-2057226127 was from Globus.

JustinKyleJames commented 2 months ago

It sounds like irodsParallelFileSizeThresholdBytes has a defined purpose, which is specific to the download case. If that's true, I think we have to consider whether uploads should have a dedicated option or not.

As for the conditions you proposed in #84 (comment), those sound fine.

The reason I didn't consider making a new setting is because the existing one has a generic name and 1) not sure if we can/should rename it at this point and 2) if it isn't renamed it might be confusing.

One thought is to use the same parameter but have a default value (say, 32 MiB) in the case of uploads.

trel commented 2 months ago

One thought is to use the same parameter but have a default value (say, 32 MiB) in the case of uploads.

That might be nice. And just a documentation exercise for those who might go looking.

korydraughn commented 2 months ago

Agreed.

JustinKyleJames commented 2 months ago

One thought is to use the same parameter but have a default value (say, 32 MiB) in the case of uploads.

That might be nice. And just a documentation exercise for those who might go looking.

I made this change to the code and README.