irods / python-irodsclient

A Python API for iRODS
Other
63 stars 73 forks source link

Renaming a heavy collection fails #532

Closed mstfdkmn closed 2 months ago

mstfdkmn commented 5 months ago

Hi, when I wanted to rename (not move - copy/remove) a heavy collection (large in size and so many nested sub items), the PRC failed whereas I could rename this by imv.

Here is some information about the collection that I wanted to rename:

(venv) u0137480@CRD-L-07856:~$ iquest "select sum(DATA_SIZE) where COLL_NAME like '/kuleuven_tier1_pilot/home/datateam/testdata%'"
DATA_SIZE = 5344488114396
------------------------------------------------------------
(venv) u0137480@CRD-L-07856:~$ iquest "select count(DATA_ID) where COLL_NAME like '/kuleuven_tier1_pilot/home/datateam/t
estdata%'"
DATA_ID = 36602
------------------------------------------------------------
(venv) u0137480@CRD-L-07856:~$ iquest "select count(COLL_ID) where COLL_NAME like '/kuleuven_tier1_pilot/home/datateam/t
estdata%'"
COLL_ID = 2456
------------------------------------------------------------

Here is what I got as client:

my first attempt:

(venv) u0137480@CRD-L-07856:~$ python3 move.py
CRITICAL:irods.connection:The read operation timed out
ERROR:irods.connection:Could not receive server response
Traceback (most recent call last):
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/message/__init__.py", line 204, in _recv_message_in_len
    buf = sock.recv(size_left, socket.MSG_WAITALL)
  File "/usr/lib/python3.10/ssl.py", line 1256, in recv
    raise ValueError(
ValueError: non-zero flags not allowed in calls to recv() on <class 'ssl.SSLSocket'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 132, in recv
    msg = iRODSMessage.recv(self.socket)
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/message/__init__.py", line 294, in recv
    rsp_header_size = _recv_message_in_len(sock, 4)
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/message/__init__.py", line 206, in _recv_message_in_len
    buf = sock.recv(size_left)
  File "/usr/lib/python3.10/ssl.py", line 1259, in recv
    return self.read(buflen)
  File "/usr/lib/python3.10/ssl.py", line 1132, in read
    return self._sslobj.read(len)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/u0137480/move.py", line 12, in <module>
    session.collections.move("/kuleuven_tier1_pilot/home/datateam/testdata_newname", "/kuleuven_tier1_pilot/home/datateam/testdata")
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/manager/collection_manager.py", line 130, in move
    response = conn.recv()
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 143, in recv
    raise NetworkException("Could not receive server response")
irods.exception.NetworkException: Could not receive server response
ERROR:irods.connection:Unable to send message. Connection to remote host may have closed. Releasing connection from pool.
Exception ignored in: <function Connection.__del__ at 0x7f9930b80280>
Traceback (most recent call last):
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 108, in __del__
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 321, in disconnect
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 124, in send
irods.exception.NetworkException: Unable to send message

my next attempts:

(venv) u0137480@CRD-L-07856:~$ python3 move.py
Traceback (most recent call last):
  File "/home/u0137480/move.py", line 12, in <module>
    session.collections.move("/kuleuven_tier1_pilot/home/datateam/testdata_newname", "/kuleuven_tier1_pilot/home/datateam/testdata")
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/manager/collection_manager.py", line 130, in move
    response = conn.recv()
  File "/home/u0137480/projects/developments/mango/venv/lib/python3.10/site-packages/irods/connection.py", line 151, in recv
    raise get_exception_by_code(msg.int_info, err_msg)
irods.exception.CAT_SQL_ERR: None

Here is the iRODS logs:

[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"SQL Error message: [unixODBC][Driver Manager]Function sequence error","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"SQLCODE: 0","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"SQLSTATE: S1010","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"_cllExecSqlNoResult: SQLExecDirect error: -1 sql:update R_COLL_MAIN set coll_name = substr(coll_name,1,?) || ? || substr(coll_name, ?) where substr(parent_coll_name,1,?) = ? or parent_coll_name  = ?","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[4]=53","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[2]=/testdata","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[5]=/kuleuven_tier1_pilot/home/datateam/testdata_newname/","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"database","log_level":"info","log_message":"chlRenameObject cmlExecuteNoAnswerSql update failure -806000","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[6]=/kuleuven_tier1_pilot/home/datateam/testdata_newname","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[3]=53","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.063Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"legacy","log_level":"info","log_message":"bindVar[1]=35","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.063Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}
[2024-04-11T15:16:59.064Z][icts-p-cloud-rdm-leu-2] irods stdout | {"log_category":"database","log_level":"info","log_message":"chlRenameObject cmlExecuteNoAnswerSql(rollback) succeeded","request_api_name":"DATA_OBJ_RENAME_AN","request_api_number":627,"request_api_version":"d","request_client_user":"vsc33586","request_host":"127.0.0.1","request_proxy_user":"vsc33586","request_release_version":"rods4.3.1","server_host":"irods.hpc.kuleuven.be","server_pid":10265,"server_timestamp":"2024-04-11T15:16:59.064Z","server_type":"agent","server_zone":"kuleuven_tier1_pilot"}

This is not a move (copy/remove from source to target) but a rename (change of name - change of records in data base). Like here https://github.com/irods/python-irodsclient/blob/main/irods/test/collection_test.py#L95

Is this something expected? If yes can we know thresholds? Does the size of a collection matters or the amount of sub items in it?

Btw, our irods catalog is Mysql. Thanks.

alanking commented 5 months ago

We'll look into this soon.

Putting this in 2.1.0, but might pull it in to 2.0.1 if time allows.

d-w-moore commented 4 months ago

Hi @mstfdkmn, I know it's been a little while; but if you've still got access to this test data or can re-create these conditions on the system in question, I'd be interested in:

To answer your other question,I don't think there is an "expected" threshold for number, and I wouldn't expect total data size to matter either. It's just the coll_name and parent_coll_name being updated for each DB row describing a sub-collection. In fact with fewer than 37000 data objects and 2500 subcollections, I wouldn't have expected the server-side socket to time out either, so I'm definitely confused.

korydraughn commented 3 months ago

Please include the types of resources touched during the rename.

mstfdkmn commented 3 months ago

Please include the types of resources touched during the rename.

I am not sure I understood exactly what you meant here, but if you meant I need to specify a resource name, I guess no option/parameter available to pass the resource name to the move() method.

mstfdkmn commented 3 months ago

Hi, I reproduced this with the same data in our pilot zone (in the past this zone had the replication resource, so some of data objects used in this test have multiple replicas).

Seeing the script /home/u0137480/move.py or at very least the operative part where the error occurred.

root@33d6bc1286de:/# cat move.py
import os, os.path
from irods.session import iRODSSession

env_file = os.getenv('IRODS_ENVIRONMENT_FILE', os.path.expanduser('~/.irods/irods_environment.json'))
with iRODSSession(irods_env_file=env_file) as session:
    session.collections.move("/kuleuven_tier1_pilot/home/datateam/testdata", "/kuleuven_tier1_pilot/home/datateam/testdata_rename")

Knowing iRODS server version.

>>> session.server_version
(4, 3, 1)

Knowing the total time imv took to (successfully) rename the collection. Took about 35 minutes.

Another input: I was able to rename quickly a collection that contains a single large file (2.5TB) in another zone that doenst have the replication resource.

Unfortunately I have not been able to transfer somehow the dataset I used in our pilot zone to another zone which doesnt have replication. After I transfer this data, I will be able to test with the same data in my test to exclude the potential effect of replication.

korydraughn commented 3 months ago

Please include the types of resources touched during the rename.

I am not sure I understood exactly what you meant here, but if you meant I need to specify a resource name, I guess no option/parameter available to pass the resource name to the move() method.

I see how that can be confusing. I'll try and explain what I was thinking ...

If you rename a data object which has replicas in a unixfilesystem resource, the default behavior is to update the catalog and the replicas in storage. The s3 resource may do something similar.

So, the behavior of the rename operation depends on the storage resources involved and the server's configuration. That's why I was curious to know what resource types may have been involved in the rename.

After re-reading the issue, I don't think it's important to know about the resources due to imv succeeding.

Leaving it to @d-w-moore.

d-w-moore commented 3 months ago

We'll look into this soon.

Putting this in 2.1.0, but might pull it in to 2.0.1 if time allows.

Since we have not been able to reproduce it, I'm thinking the milestone should be 2.1.1

d-w-moore commented 3 months ago

We'll look into this soon. Putting this in 2.1.0, but might pull it in to 2.0.1 if time allows.

Since we have not been able to reproduce it, I'm thinking the milestone should be 2.1.1

564 may be related.

trel commented 3 months ago

Let's keep this here for now - it may be confirmed to be fixed / good by release time.

As 2.1.0 approaches... we can bump to 2.1.1 if we have no new information.

d-w-moore commented 3 months ago

Let's keep this here for now - it may be confirmed to be fixed / good by release time.

As 2.1.0 approaches... we can bump to 2.1.1 if we have no new information.

Agreed ... it leaves room for more progress.

d-w-moore commented 2 months ago

@mstfdkmn - though I wasn't able to reproduce, I'm pretty confident this issue has the same fix as #564. With a patch from pull request #570 (which guarantees connection_timeout applies to all connections whether new or cached) I believe all you would need to do is modify your renaming script, in your reply I've quoted below, to do this:

session.connection_timeout = None

just prior to the session.collections.move( ... )call.

If you'd like to try it on your end, I'd welcome any further input and/or news on results from this proposed experiment (especially since it's likely that I personally don't have access to any iRODS collections heavy enough to evoke the NetworkException on renaming) .

Hi, I reproduced this with the same data in our pilot zone (in the past this zone had the replication resource, so some of data objects used in this test have multiple replicas).

Seeing the script /home/u0137480/move.py or at very least the operative part where the error occurred.

root@33d6bc1286de:/# cat move.py
import os, os.path
from irods.session import iRODSSession

env_file = os.getenv('IRODS_ENVIRONMENT_FILE', os.path.expanduser('~/.irods/irods_environment.json'))
with iRODSSession(irods_env_file=env_file) as session:
    session.collections.move("/kuleuven_tier1_pilot/home/datateam/testdata", "/kuleuven_tier1_pilot/home/datateam/testdata_rename")

Knowing iRODS server version.

>>> session.server_version
(4, 3, 1)

Knowing the total time imv took to (successfully) rename the collection. Took about 35 minutes.

Another input: I was able to rename quickly a collection that contains a single large file (2.5TB) in another zone that doenst have the replication resource.

Unfortunately I have not been able to transfer somehow the dataset I used in our pilot zone to another zone which doesnt have replication. After I transfer this data, I will be able to test with the same data in my test to exclude the potential effect of replication.

mstfdkmn commented 2 months ago

Dear @d-w-moore , indeed it worked, I could rename the same collection with the fix in your branch. Thanks.

d-w-moore commented 2 months ago

@mstfdkmn Excellent ! Closing this issue.

alanking commented 2 months ago

Resolved by changes in https://github.com/irods/python-irodsclient/pull/570.