irods / python-irodsclient

A Python API for iRODS
Other
62 stars 73 forks source link

Different operations(calls) in the same session results in the same pid #603

Closed mstfdkmn closed 3 months ago

mstfdkmn commented 3 months ago

python-irodsclient 2.0.1 irods 4.3.2 https://github.com/irods/irods_rule_engine_plugin_audit_amqp

I am not sure I am reporting on the correct repository (this might go to the irods or the audit plugin repo).

When I am interacting with irods through the PRC, in the audit plugin log I am observing the followings:

![image](https://github.com/user-attachments/assets/b381d6ef-53c5-4f6f-b590-0a76330ea2e8)

- If actions are taken in separate sessions, we see different PIDs for messages in the output of the audit plugin.

with iRODSSession(irods_env_file=env_file) as session: session.data_objects.create("/u0137480/home/mustafa/myfile_prc_new.txt") with iRODSSession(irods_env_file=env_file) as session: session.data_objects.put("test.txt", "/u0137480/home/mustafa/test.txt") with iRODSSession(irods_env_file=env_file) as session: session.data_objects.unlink("/u0137480/home/mustafa/myfile_prc_new.txt", force=True)

![image](https://github.com/user-attachments/assets/0d76c55f-3a37-4d4f-a9cf-26a2c53316b0)

- If an action that requires multiple thread is taken in a (single) PRC session, we see we see different PIDs for messages in the output of the audit plugin.

with iRODSSession(irods_env_file=env_file) as session: session.data_objects.put("100MB.bin", "/u0137480/home/mustafa/100MB.bin")


![image](https://github.com/user-attachments/assets/247ae9c4-f95e-4610-9ab9-d77e04016c95)

Are these expect behaviors? I would expect to see the same PID for multiple threads in the same action/api call in the same session, or different PIDs for different actions in the same action/call. I am a bit confused. What does the PID in the audit plugin refer - a process identifier or a thread identifier?
korydraughn commented 3 months ago

Every PRC session object manages its own pool of connections. Connections are reused to improve performance. Anytime a connection is reused, you'll see that same PID on the server side.

korydraughn commented 3 months ago

@mstfdkmn Does that answer your question?

mstfdkmn commented 3 months ago

Yes, thanks.

mstfdkmn commented 3 months ago

Hi @korydraughn, one more topic that I need your verification on. while downloading large files (size is bigger than 32MB), I am seeing pep_api_data_obj_open_post is invoked as number of times as existing threads. In the download of a 100MB file, generally three threads exist and because of that pep_api_data_obj_open_post is triggered three times. However regardless of how many times the object is opened or how many times pep_api_data_obj_open_post is triggered pep_api_data_obj_close_post is invoked only one time.

Is this related to your explanation above? That is, it is normal to see the "close pep" is triggered only one time because there is only one connection/process set but since the same connection is reused for threads to read the object we see "open pep" is triggered more than one.

korydraughn commented 3 months ago

Not quite.

If I want to download a data object in parallel, I'd use N threads, each with their own iRODS connection. Each of those threads would call open, seek, read, close. So I'd expect the number of calls to open and close to be identical.

Perhaps pep_api_replica_close_post is being invoked instead of pep_api_data_obj_close_post for some of the close operations?

but since the same connection is reused for threads to read the object we see "open pep" is triggered more than one.

So you've identified that a single iRODS connection is being reused across the same download threads?

If that's true, then there must be a lock involved to synchronize access to the iRODS connection.

@d-w-moore What do you think?

mstfdkmn commented 3 months ago

Perhaps pep_api_replica_close_post is being invoked instead of pep_api_data_obj_close_post for some of the close operations?

Yes, I can verify that pep_api_replica_close_post is triggered for the threads other than main thread that is part of the process. So, if N3 threads are used, N1 invokes pep_api_data_obj_open_post and pep_api_data_obj_close_post, and N2 and N3 invoke pep_api_data_obj_open_post and pep_api_replica_close_post. And N1 triggers pep_api_data_obj_close_post as the last one. If I try to reflect my observation, that is;

N1 obj_open/obj_read/obj_read...
N2 obj_open/obj_read/obj_read.../replica_close
N3 obj_open/obj_read/obj_read.../replica_close
N1 obj_close

So you've identified that a single iRODS connection is being reused across the same download threads?

Actually that is how I understood from your reply above. -->Every PRC session object manages its own pool of connections. Connections are reused to improve performance

korydraughn commented 3 months ago

Ah, everything makes sense now.

The calls to replica_close are to prevent the close operations from attempting to update the catalog. The normal original close operation will always attempt to update the catalog. This only matters when writing in parallel.

For reading, the use of replica_close is unnecessary, but not incorrect by any means.

mstfdkmn commented 3 months ago

Btw, in order to document the difference between clients that use the same apis here I am sharing my findings that shows how gocommands (go-irodsclient) work:

korydraughn commented 3 months ago

Yep, that's fine. That will work as long as everything is coordinated correctly. However, choosing to NOT use the rc_replica_open and rc_replica_close APIs does mean some performance is lost because the iRODS server has to do more work for the rcDataObjClose API calls and there's more network traffic.