irods / irods_capability_storage_tiering

BSD 3-Clause "New" or "Revised" License
5 stars 10 forks source link

plugin doesn't trim migrated replicas after upgrade #211

Closed kript closed 10 months ago

kript commented 1 year ago

After upgrading from 4.2.7 to 4.2.11 on Ubuntu 18.04, we found that the tiering plugin created a replica at the target tier, but did not remove the replicas from the source tier. This did trim correctly when on 4.2.7.

Packages;

$ dpkg -l | grep irod
rc  irods-database-plugin-oracle                     4.2.7                                           amd64        The integrated Rule-Oriented Data System
ii  irods-database-plugin-postgres                   4.2.11-1~xenial                                 amd64        The integrated Rule-Oriented Data System
ii  irods-externals-autoconf5ad3567c-0               1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-avro1.7.7-0                      1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-avro1.9.0-0                      1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-aws-sdk-cpp1.4.89-1              1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-boost1.60.0-0                    1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-boost1.67.0-0                    1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-catch22.3.0-0                    1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-clang-runtime3.8-0               1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-clang-runtime6.0-0               1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-clang6.0-0                       1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-cmake3.11.4-0                    1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-cppzmq4.2.3-0                    1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-cpr1.3.0-0                       1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-cpr1.3.0-1                       1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-elasticlient0.1.0-1              1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-elasticlient0.1.0-2              1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-epm4.2-0                         1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-fmt6.1.2-1                       1.0~xenial                                      amd64        iRODS Build Dependency
ii  irods-externals-imagemagick7.0.8-0               1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-jansson2.7-0                     1.0~bionic                                      amd64        iRODS Build Dependency
ii  irods-externals-json3.7.3-0                      1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-libarchive3.3.2-0                1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-libarchive3.3.2-1                1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-libs359b62371-0                  1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-mungefs1.0.3-0                   1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-nanodbc2.13.0-0                  1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-nanodbc2.13.0-1                  1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-qpid-with-proton0.34-0           1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-qpid-with-proton0.34-1           1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-qpid-with-proton0.34-2           1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-redis4.0.10-0                    1.0~bionic
               amd64        iRODS Build Dependency
ii  irods-externals-zeromq4-14.1.3-0                 1.0~xenial
               amd64        iRODS Build Dependency
ii  irods-externals-zeromq4-14.1.6-0                 1.0~xenial
               amd64        iRODS Build Dependency
rc  irods-icat                                       4.1.12
               amd64        The integrated Rule-Oriented Data System (iRODS)
hi  irods-icommands                                  4.2.11-1~xenial
               amd64        The integrated Rule-Oriented Data System
ii  irods-rule-engine-plugin-audit-amqp              4.2.11.0-1~xenial
               amd64        The integrated Rule-Oriented Data System
ii  irods-rule-engine-plugin-document-type           4.2.11.0-1~xenial
               amd64        The integrated Rule-Oriented Data System
ii  irods-rule-engine-plugin-elasticsearch           4.2.11.0-1~xenial
               amd64        The integrated Rule-Oriented Data System
ii  irods-rule-engine-plugin-indexing                4.2.11.0-1~xenial
               amd64        The integrated Rule-Oriented Data System
ii  irods-rule-engine-plugin-unified-storage-tiering 4.2.11.0-1~xenial
               amd64        The integrated Rule-Oriented Data System
hi  irods-runtime                                    4.2.11-1~xenial
               amd64        The integrated Rule-Oriented Data System
hi  irods-server                                     4.2.11-1~xenial
               amd64        The integrated Rule-Oriented Data System

metdata set;

# source tree 
$ imeta ls -R root
AVUs defined for resource root:
attribute: irods::storage_tiering::group
value: ega_transfer_group
units: 0
----
attribute: irods::storage_tiering::query
value: SELECT DATA_NAME, COLL_NAME, USER_NAME, DATA_REPL_NUM  WHERE META_DATA_ATTR_NAME = 'tier:single-copy' AND META_DATA_ATTR_VALUE = '1' AND DATA_RESC_HIER like 'root;replicate%' AND DATA_ACCESS_TYPE >= '1120' AND USER_ZONE = 'seq-dev' AND  USER_TYPE = 'rodsadmin' and USER_NAME like 'irods%'
units:
----
attribute: irods::storage_tiering::time
value: 18000000000
units:

$ imeta ls -R root
AVUs defined for resource root:
attribute: irods::storage_tiering::group
value: ega_transfer_group
units: 0
----
attribute: irods::storage_tiering::query
value: SELECT DATA_NAME, COLL_NAME, USER_NAME, DATA_REPL_NUM  WHERE META_DATA_ATTR_NAME = 'tier:single-copy' AND META_DATA_ATTR_VALUE = '1' AND DATA_RESC_HIER like 'root;replicate%' AND DATA_ACCESS_TYPE >= '1120' AND USER_ZONE = 'seq-dev' AND  USER_TYPE = 'rodsadmin' and USER_NAME like 'irods%'
units:
----
attribute: irods::storage_tiering::time
value: 18000000000
units:

# destination tree
$ imeta ls -R noReplRoot
AVUs defined for resource noReplRoot:
attribute: irods::storage_tiering::group
value: ega_transfer_group
units: 1
----
attribute: irods::storage_tiering::maximum_delay_time_in_seconds
value: 30
units:
----
attribute: irods::storage_tiering::minimum_delay_time_in_seconds
value: 1
units:
----
attribute: irods::storage_tiering::minimum_restage_tier
value: true
units:
----
attribute: irods::storage_tiering::object_limit
value: 500
units:
----
attribute: irods::storage_tiering::verification
value: checksum
units:

server_config.json rule engine stanza;

                {
                 "instance_name": "irods_rule_engine_plugin-storage_tiering-instance",
                 "plugin_name": "irods_rule_engine_plugin-unified_storage_tiering",
                 "plugin_specific_configuration": {
                    "access_time_attribute" : "irods::access_time",
                    "group_attribute" : "irods::storage_tiering::group",
                    "time_attribute" : "irods::storage_tiering::time",
                    "query_attribute" : "irods::storage_tiering::query",
                    "verification_attribute" : "irods::storage_tiering::verification",
                    "data_movement_parameters_attribute" : "irods::storage_tiering::restage_delay",
                    "minimum_restage_tier" : "irods::storage_tiering::minimum_restage_tier",
                    "preserve_replicas" : "irods::storage_tiering::preserve_replicas",
                    "object_limit" : "irods::storage_tiering::object_limit",
                    "default_data_movement_parameters" : "<EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF>",
                    "minumum_delay_time" : "irods::storage_tiering::minimum_delay_time_in_seconds",
                    "maximum_delay_time" : "irods::storage_tiering::maximum_delay_time_in_seconds",
                    "time_check_string" : "TIME_CHECK_STRING",
                    "data_transfer_log_level" : "LOG_NOTICE"
                        }
                },

Note that the root tree has a replication passthru, and so any objects in the tree should have two replicas, wheras the noReplRoot, as its name implies, has no such structure

$ ilsresc root
root:passthru
└── replicate:replication
    ├── remote:random
...
$ ilsresc noReplRoot
noReplRoot:passthru
└── blueNoRepl:random
    └── blue:passthru
        └── irods-g2-dev-sdb:unixfilesystem

We see the rules appear in the queue to migrate the objects once the metadata has been set on the object and it runs without obvious related errors in re, server or rodslog. Once the rules have run however we are left with three replicas, rather than one;

$ ils -l pathogens_irods_test.sh
  jc18              0 root;replicate;remote;red1;irods-seq-i05-fg          838 2023-02-17.15:27 & pathogens_irods_test.sh
  jc18              1 root;replicate;sanger;green1;irods-seq-sr01-ddn-ra08-18-19-20          838 2023-02-17.15:27 & pathogens_irods_test.sh
  jc18              2 noReplRoot;blueNoRepl;blue;irods-g2-dev-sdb          838 2023-02-17.15:35 & pathogens_irods_test.sh
kript commented 1 year ago

I should note that I get this behaviour when I add the metadata both with imeta and with iput --metadata

alanking commented 1 year ago

We will look into it. My initial thought is that this is related to some changes which occurred in the trim API as a result of the logical locking effort, so that is where I would suggest we start looking.

trel commented 1 year ago

Just noting for future readers - "root tree has a replication passthru" is a confusing phrase... root only has a replication resource.

A passthru is different.

trel commented 1 year ago

I have configured 4.2.11 on an Ubuntu 18.04 container with the following settings...

Set up hierarchies...

irods@1b97fdcaad93:~$ ilsresc
demoResc:unixfilesystem
noReplRoot:passthru
└── blue:random
    ├── ufs3:unixfilesystem
    └── ufs4:unixfilesystem
root:passthru
└── replicate:replication
    ├── ufs1:unixfilesystem
    └── ufs2:unixfilesystem

Configured tiers...

irods@1b97fdcaad93:~$ imeta ls -R root
AVUs defined for resource root:
attribute: irods::storage_tiering::group
value: ega_transfer_group
units: 0
----
attribute: irods::storage_tiering::query
value: SELECT DATA_NAME, COLL_NAME, USER_NAME, DATA_REPL_NUM  WHERE META_DATA_ATTR_NAME = 'tier:single-copy' AND META_DATA_ATTR_VALUE = '1' AND DATA_RESC_HIER like 'root;replicate%' AND DATA_ACCESS_TYPE >= '1120' AND USER_ZONE = 'tempZone' AND  USER_TYPE = 'rodsadmin' and USER_NAME like 'irods%'
units: 
----
attribute: irods::storage_tiering::time
value: 18000000000
units: 

irods@1b97fdcaad93:~$ imeta ls -R noReplRoot
AVUs defined for resource noReplRoot:
attribute: irods::storage_tiering::group
value: ega_transfer_group
units: 1
----
attribute: irods::storage_tiering::maximum_delay_time_in_seconds
value: 30
units: 
----
attribute: irods::storage_tiering::minimum_delay_time_in_seconds
value: 1
units: 
----
attribute: irods::storage_tiering::minimum_restage_tier
value: true
units: 
----
attribute: irods::storage_tiering::object_limit
value: 500
units: 
----
attribute: irods::storage_tiering::verification
value: checksum
units: 

Installed the periodic rule...

irods@1b97fdcaad93:~$ cat sanger.r        
{
   "rule-engine-instance-name": "irods_rule_engine_plugin-unified_storage_tiering-instance",
   "rule-engine-operation": "irods_policy_schedule_storage_tiering",
   "delay-parameters": "<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><PLUSET>1s</PLUSET><EF>1h DOUBLE UNTIL SUCCESS OR 6 TIMES</EF>",
   "storage-tier-groups": [
       "ega_transfer_group"
   ]
}
INPUT null
OUTPUT ruleExecOut
irods@1b97fdcaad93:~$ irule -F sanger.r 
irods@1b97fdcaad93:~$ iqstat
id     name
10042 {"rule-engine-operation":"irods_policy_storage_tiering","storage-tier-groups":["ega_transfer_group"]} 

Then ... testing as a new irodstester rodsadmin account...

root@1b97fdcaad93:/# echo "hola" > hola                                                                                                                                                                                                    
root@1b97fdcaad93:/# iput -R root hola                                                                                                                                                                                                    
root@1b97fdcaad93:/# imeta add -d hola tier:single-copy 1
root@1b97fdcaad93:/# ils -L hola                                                                              
  irodstester       0 root;replicate;ufs1            5 2023-03-15.20:36 & hola
        generic    /tmp/ufs1vault/home/irodstester/hola
  irodstester       1 root;replicate;ufs2            5 2023-03-15.20:36 & hola
        generic    /tmp/ufs2vault/home/irodstester/hola

root@1b97fdcaad93:/# ils -L hola; iqstat -a
  irodstester       2 noReplRoot;blue;ufs3            5 2023-03-15.20:42 & hola
    sha2:Ez7piSk/knNjASgMbxTInVISAMF9zc7MowzSBwUzLUQ=    generic    /tmp/ufs3vault/home/irodstester/hola
No delayed rules pending

It worked! single replica. checksum. on the noReplRoot tier.

But the log was a bit noisy with a caught error due to the attempted migration of the second replica (duplicate key error when trying to move replica 2 since replica 3 was already created on the target resource)...

Not sure, but I think we've already cleaned this up since 4.2.11... can investigate separately...

Mar 15 20:42:47 pid:7627 NOTICE: bindVar[1]=10040
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[2]=10038
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[3]=hola
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[4]=2
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[5]=
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[6]=generic
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[7]=5
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[8]=EMPTY_RESC_GROUP_NAME
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[9]=EMPTY_RESC_NAME
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[10]=EMPTY_RESC_HIER
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[11]=10023
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[12]=/tmp/ufs4vault/home/irodstester/hola
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[13]=irodstester
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[14]=tempZone
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[15]=2
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[16]=
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[17]=
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[18]=00000000000
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[19]=0
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[20]=0
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[21]=
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[22]=01678927367
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[23]=01678927367
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[24]=10023
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[25]=/tmp/ufs4vault/home/irodstester/hola
Mar 15 20:42:47 pid:7627 NOTICE: bindVar[26]=10040
Mar 15 20:42:47 pid:7627 NOTICE: _cllExecSqlNoResult: SQLExecDirect error: -1 sql:insert into R_DATA_MAIN ( data_id,                        coll_id,                         data_name,                        data_repl_num,                        data_version,                        data_type_name,                        data_size,                        resc_group_name,                        resc_name,                        resc_hier,                        resc_id,                        data_path,                        data_owner_name,                        data_owner_zone,                        data_is_dirty,                        data_status,                        data_checksum,                        data_expiry_ts,                        data_map_id,                        data_mode,                        r_comment,                        create_ts,                        modify_ts ) select ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? where not exists (select data_id from R_DATA_MAIN where resc_id=? and data_path=? and data_id=?)
Mar 15 20:42:47 pid:7627 NOTICE: SQLSTATE: S1010
Mar 15 20:42:47 pid:7627 NOTICE: SQLCODE: 0
Mar 15 20:42:47 pid:7627 NOTICE: SQL Error message: [unixODBC][Driver Manager]Function sequence error
Mar 15 20:42:47 pid:7627 NOTICE: SQLSTATE: 23505
Mar 15 20:42:47 pid:7627 NOTICE: SQLCODE: 1
Mar 15 20:42:47 pid:7627 NOTICE: SQL Error message: ERROR: duplicate key value violates unique constraint "idx_data_main2"; Error while executing the query
Mar 15 20:42:47 pid:7627 NOTICE: chlRegReplica cmlExecuteNoAnswerSql(insert) failure -809000
Mar 15 20:42:47 pid:7627 NOTICE: chlRegReplica cmlExecuteNoAnswerSql(rollback) succeeded
Mar 15 20:42:47 pid:7627 remote addresses: 10.15.0.6, 127.0.0.1 ERROR: [filePathRegRepl] - failed to register replica for [/tempZone/home/irodstester/hola], status:[-809000]
Mar 15 20:42:47 pid:7627 remote addresses: 10.15.0.6, 127.0.0.1 ERROR: [create_new_replica:362] - failed to register physical path [error_code=[-809000], path=[/tempZone/home/irodstester/hola], hierarchy=[noReplRoot;blue;ufs4]
Mar 15 20:42:48 pid:7627 remote addresses: 10.15.0.6, 127.0.0.1 ERROR: [replicate_data_object:748] - failed to replicate [/tempZone/home/irodstester/hola]
Mar 15 20:42:48 pid:7627 NOTICE: rsDataObjRepl - Failed to replicate data object. status:[-809000]
Mar 15 20:42:48 pid:7620 remote addresses: 10.15.0.6, 127.0.0.1 ERROR: rsExecRuleExpression : -809000, [-]      /repos/irods_capability_storage_tiering/libirods_rule_engine_plugin-unified_storage_tiering.cpp:750:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [CATALOG_ALREADY_HAS_ITEM_BY_THAT_NAME]  errno [] -- message [iRODS Exception:
    file: /repos/irods_capability_storage_tiering/libirods_rule_engine_plugin-unified_storage_tiering.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 110
    code: -809000 (CATALOG_ALREADY_HAS_ITEM_BY_THAT_NAME)
    message:
        failed to migrate [/tempZone/home/irodstester/hola] to [noReplRoot]

But all in all, it did migrate, and it did trim.

Note this was a single container - no networking weirdness, no 'wrong-host', no separate client machine.

trel commented 10 months ago

Sanger has now seen this behave on 4.2.11 and 4.2.12 in testing. Probably a candidate for closing...

korydraughn commented 10 months ago

Agreed. Please close.