irods / irods_capability_storage_tiering

BSD 3-Clause "New" or "Revised" License
5 stars 10 forks source link

Tier-out migration fails for all replicas except 1 when multiple replicas exist in a tier #238

Open alanking opened 10 months ago

alanking commented 10 months ago

Bug Report

iRODS Version, OS and Version

iRODS server: 4.3.1 Storage tiering plugin: 4.3.1 OS: centos 7

What did you try to do?

Set up a tier group with 3 resource hierarchies:

$ ilsresc
repl1:replication
├── ufs1:unixfilesystem
├── ufs2:unixfilesystem
└── ufs3:unixfilesystem
repl2:replication
├── ufs4:unixfilesystem
├── ufs5:unixfilesystem
└── ufs6:unixfilesystem
repl3:replication
├── ufs7:unixfilesystem
├── ufs8:unixfilesystem
└── ufs9:unixfilesystem
$ imeta ls -R repl1
AVUs defined for resource repl1:
attribute: irods::storage_tiering::group
value: example_group
units: 0
----
attribute: irods::storage_tiering::time
value: 5
units: 
$ imeta ls -R repl2
AVUs defined for resource repl2:
attribute: irods::storage_tiering::group
value: example_group
units: 1
----
attribute: irods::storage_tiering::time
value: 10
units: 
$ imeta ls -R repl3
AVUs defined for resource repl3:
attribute: irods::storage_tiering::group
value: example_group
units: 2

Then I put an object in...

$ iput -R repl1 irodsctl foo
$ ils -l foo
  rods              0 repl1;ufs1          284 2023-12-12.20:10 & foo
  rods              1 repl1;ufs2          284 2023-12-12.20:10 & foo
  rods              2 repl1;ufs3          284 2023-12-12.20:10 & foo
$ imeta ls -d foo
AVUs defined for dataObj /tempZone/home/rods/foo:
attribute: irods::access_time
value: 1702411848
units: 

Then try to tier it out:

$ irule -r irods_rule_engine_plugin-unified_storage_tiering-instance -F example_unified_tiering_invocation.r 
$ iqstat
id     name
10041 {"rule-engine-operation":"irods_policy_storage_tiering","storage-tier-groups":["example_group_g2","example_group"]} 

Expected behavior

I expected the tier out to occur with no errors or issues.

Observed behavior (including steps to reproduce, if applicable)

After a while, 3 migrations are scheduled:

$ iqstat
id     name
10042 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>16s</PLUSET>","destination-resource":"repl2","group-name":"example_group","md5":"130409f4c26e9054a7e99870855e9b8b","object-path":"/tempZone/home/rods/foo","preserve-replicas":false,"rule-engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"0","source-resource":"repl1","user-name":"rods","verification-type":"catalog"} 
10043 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>22s</PLUSET>","destination-resource":"repl2","group-name":"example_group","md5":"4274ed5655c75052412bc44e691bdedd","object-path":"/tempZone/home/rods/foo","preserve-replicas":false,"rule-engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"2","source-resource":"repl1","user-name":"rods","verification-type":"catalog"} 
10044 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>21s</PLUSET>","destination-resource":"repl2","group-name":"example_group","md5":"a55f702cdaa627c7d7a6b294cc04d0fb","object-path":"/tempZone/home/rods/foo","preserve-replicas":false,"rule-engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"1","source-resource":"repl1","user-name":"rods","verification-type":"catalog"} 

The tier-out succeeds!

$ ils -l foo
  rods              3 repl2;ufs4          284 2023-12-12.20:12 & foo
  rods              4 repl2;ufs5          284 2023-12-12.20:12 & foo
  rods              5 repl2;ufs6          284 2023-12-12.20:12 & foo
$ imeta ls -d foo
AVUs defined for dataObj /tempZone/home/rods/foo:
attribute: irods::access_time
value: 1702411944
units: 
----
attribute: irods::storage_tiering::group
value: example_group
units: 3

...but there are 2 migration tasks remaining:

$ iqstat
id     name
10042 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>16s</PLUSET>","destination-resource":"repl2","group-name":"example_group","md5":"130409f4c26e9054a7e99870855e9b8b","object-path":"/tempZone/home/rods/foo","preserve-replicas":false,"rule-engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"0","source-resource":"repl1","user-name":"rods","verification-type":"catalog"} 
10043 {"delay_conditions":"<INST_NAME>irods_rule_engine_plugin-unified_storage_tiering-instance</INST_NAME><EF>60s DOUBLE UNTIL SUCCESS OR 5 TIMES</EF><PLUSET>22s</PLUSET>","destination-resource":"repl2","group-name":"example_group","md5":"4274ed5655c75052412bc44e691bdedd","object-path":"/tempZone/home/rods/foo","preserve-replicas":false,"rule-engine-instance-name":"irods_rule_engine_plugin-unified_storage_tiering-instance","rule-engine-operation":"irods_policy_data_movement","source-replica-number":"2","source-resource":"repl1","user-name":"rods","verification-type":"catalog"} 

...and a bunch of errors in the log. The first errors have to do with duplicate entries in the catalog:

"apply_policy_for_tier_group :: no resources found for group [example_group_g2]"
"chlRegReplica cmlExecuteNoAnswerSql(insert) failure -819000"
"chlRegReplica cmlExecuteNoAnswerSql(rollback) succeeded"
"[filePathRegRepl] - failed to register replica for [/tempZone/home/rods/foo], status:[-819000]"
"[create_new_replica:361] - failed to register physical path [error_code=[-819000], path=[/tempZone/home/rods/foo], hierarchy=[repl2;ufs4]"
"[filePathRegRepl] - failed to register replica for [/tempZone/home/rods/foo], status:[-46000]"
"[create_new_replica:361] - failed to register physical path [error_code=[-46000], path=[/tempZone/home/rods/foo], hierarchy=[repl2;ufs4]"
"[replicate_data_object:777] - failed to replicate [/tempZone/home/rods/foo]"
"rsDataObjRepl - Failed to replicate data object. status:[-46000]"
"[replicate_data_object:777] - failed to replicate [/tempZone/home/rods/foo]"
"rsDataObjRepl - Failed to replicate data object. status:[-819000]"
"[-]    /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp:753:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [SYS_COPY_ALREADY_IN_RESC]  errno [] -- message [iRODS Exception:
    file: /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 110 
    code: -46000 (SYS_COPY_ALREADY_IN_RESC)
    message:
        failed to migrate [/tempZone/home/rods/foo] to [repl2]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.1
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.1
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.1
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
15# main::$_5::operator()() const in /usr/sbin/irodsServer
16# main in /usr/sbin/irodsServer
17# __libc_start_main in /lib64/libc.so.6
18# _start in /usr/sbin/irodsServer

]

"
"[-]    /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp:753:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [CAT_SUCCESS_BUT_WITH_NO_INFO]  errno [] -- message [iRODS Exception:
    file: /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 110 
    code: -819000 (CAT_SUCCESS_BUT_WITH_NO_INFO)
    message:
        failed to migrate [/tempZone/home/rods/foo] to [repl2]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.1
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.1
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.1
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
15# main::$_5::operator()() const in /usr/sbin/irodsServer
16# main in /usr/sbin/irodsServer
17# __libc_start_main in /lib64/libc.so.6
18# _start in /usr/sbin/irodsServer

]

"    

And later, others appear having to do with a missing source replica:

"[rsDataObjRepl:995] - [SYS_REPLICA_INACCESSIBLE: hierarchy descending from specified source resource name [repl1] does not have a replica or the replica is inaccessible at this time

]"
"[rsDataObjRepl:995] - [SYS_REPLICA_INACCESSIBLE: hierarchy descending from specified source resource name [repl1] does not have a replica or the replica is inaccessible at this time

]"
"rsDataObjRepl - Failed to replicate data object. status:[-168000]"
"rsDataObjRepl - Failed to replicate data object. status:[-168000]"
"[-]    /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp:753:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [SYS_REPLICA_INACCESSIBLE]  errno [] -- message [iRODS Exception:
    file: /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 110
    code: -168000 (SYS_REPLICA_INACCESSIBLE)
    message:
        failed to migrate [/tempZone/home/rods/foo] to [repl2]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.1
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.1
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.1
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
15# main::$_5::operator()() const in /usr/sbin/irodsServer
16# main in /usr/sbin/irodsServer
17# __libc_start_main in /lib64/libc.so.6
18# _start in /usr/sbin/irodsServer

]

"
"[-]    /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp:753:irods::error exec_rule_expression(irods::default_re_ctx &, const std::string &, msParamArray_t *, irods::callback) :  status [SYS_REPLICA_INACCESSIBLE]  errno [] -- message [iRODS Exception:
    file: /irods_plugin_source/libirods_rule_engine_plugin-unified_storage_tiering.cpp
    function: void (anonymous namespace)::replicate_object_to_resource(rcComm_t *, const std::string &, const std::string &, const std::string &, const std::string &)
    line: 110
    code: -168000 (SYS_REPLICA_INACCESSIBLE)
    message:
        failed to migrate [/tempZone/home/rods/foo] to [repl2]
stack trace:
--------------
 0# irods::stacktrace::dump() const in /lib/libirods_common.so.4.3.1
 1# irods::exception::assemble_full_display_what() const in /lib/libirods_common.so.4.3.1
 2# irods::exception::what() const in /lib/libirods_common.so.4.3.1
 3# exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 4# std::__1::__function::__func<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback), std::__1::allocator<irods::error (*)(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>, irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*&&, irods::callback&&) in /usr/lib/irods/plugins/rule_engines/libirods_rule_engine_plugin-unified_storage_tiering.so
 5# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.1
 6# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.1
 7# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 8# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
 9# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.1
10# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.1
11# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.1
12# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.1
13# agentMain(RsComm*) in /lib/libirods_server.so.4.3.1
14# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.1
15# main::$_5::operator()() const in /usr/sbin/irodsServer
16# main in /usr/sbin/irodsServer
17# __libc_start_main in /lib64/libc.so.6
18# _start in /usr/sbin/irodsServer

]

"

Eventually, the migration jobs will fail a sufficient number of times and are removed from the queue.

It seems like all the migration jobs start at the same time and one of them wins the race, locking out the others. It is mildly concerning that it gets all the way to the point of registering the physical path before an error occurs (the "database race", I assume: https://github.com/irods/irods/issues/5742#issuecomment-905439542) but after that scare, logical locking should keep things sane.

If we consider the "tracked" replica to be the "representative" replica for the group of replicas, it is the only one that needs to be scheduled for replication. The plugin seems to take care of the trimming of the other replicas, so we don't need to worry about that.

Open to other ideas.

trel commented 10 months ago

Right, we only have to make the replication jobs each 'check' before doing any work to replicate. Then, if it's already in the desired state, return early... no errors.

Oh, the first three fire at the same time... so it still might be a little noisy. Hmm....