irods / irods

Open Source Data Management Software
https://irods.org
BSD 3-Clause "New" or "Revised" License
446 stars 141 forks source link

irepl -a -U fails when data object doesn't have a 0 replica #7097

Open tedgin opened 1 year ago

tedgin commented 1 year ago

Bug Report

iRODS Version, OS and Version

iRODS 4.2.8 on CentOS 7

What did you try to do?

I have a data object with a stale replica that I tried to update using irepl -a -U. The data object doesn't have a replica with a replica number of 0.

ipc_admin@prod ~? ils -L /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs
  raptorslab        2 CyVerseRes;holmesBroker;holmes      7118199 2023-05-04.18:15 & nsd.obs
    7d80026eea72bcb14ce40d9c07c396e7    generic    /irods_vault/home/shared/VerSSA/XDA/MPC/nsd.obs
  raptorslab        3 taccRes;corral4      7083045 2023-04-26.09:47   nsd.obs
    508dd0055ceb748b10b0a02b8631cd8d    generic    /corral/irods/iplant/Vault/home/shared/VerSSA/XDA/MPC/nsd.obs

Expected behavior

I expected the stale replica to be updated with the contents of the other one.

Observed behavior (including steps to reproduce, if applicable)

ipc_admin@prod ~? irepl -M -U -a -v /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs
remote addresses: 206.207.252.35 ERROR: replUtil: repl error for /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs, status = -819000 status = -819000 CAT_SUCCESS_BUT_WITH_NO_INFO

I changed the replica on CyVerseRes to have replica number 0, and the same irepl call succeeded.

ICAT=# UPDATE r_data_main SET data_repl_num = 0 WHERE data_path = '/irods_vault/home/shared/VerSSA/XDA/MPC/nsd.obs';
UPDATE 1
ipc_admin@prod ~? ils -L /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs
  raptorslab        0 CyVerseRes;holmesBroker;holmes      7118199 2023-05-04.18:15 & nsd.obs
    7d80026eea72bcb14ce40d9c07c396e7    generic    /irods_vault/home/shared/VerSSA/XDA/MPC/nsd.obs
  raptorslab        3 taccRes;corral4      7083045 2023-04-26.09:47   nsd.obs
    508dd0055ceb748b10b0a02b8631cd8d    generic    /corral/irods/iplant/Vault/home/shared/VerSSA/XDA/MPC/nsd.obs
ipc_admin@prod ~? irepl -M -U -a -v /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs
   nsd.obs                         6.788 MB | 2.356 sec | 1 thr |  2.881 MB/s
alanking commented 1 year ago

I was not able to reproduce this with 4.2.8 on CentOS 7. Wondering if default resources and resource hierarchies are at play here?

trel commented 1 year ago

Why did you try changing the repl_num to 0? Have you had a reason to know zero was somehow special before? I ask since that doesn’t sound familiar to me.

tedgin commented 1 year ago

Here's what was logged on the catalog service provider. Nothing was logged on the source or destination resource server, nor was anything logged by the DBMS.

May  4 15:45:30 pid:163804 NOTICE: chlModDataObjMeta cmlExecuteNoAnswerSql(rollback) succeeded
May  4 15:45:30 pid:163804 NOTICE: chlModDataObjMeta cmlModifySingleTable failure -819000
May  4 15:45:30 pid:163804 remote addresses: 129.114.60.127, 206.207.252.35, 206.207.252.42, 206.207.252.52, 206.207.252.53, 206.207.252.56, 206.207.252.60, 206.207.252.65, 206.207.252.69, 206.207.252.73, 206.207.252.74, 206.207.252.75, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: [-]  /irods_git_repo/server/api/src/rsModDataObjMeta.cpp:164:int _rsModDataObjMeta(rsComm_t *, modDataObjMeta_t *) :  status [CAT_SUCCESS_BUT_WITH_NO_INFO]  errno [] -- message [_rsModDataObjMeta - Failed to modify the database for object "/iplant/home/shared/VerSSA/XDA/MPC/nsd.obs" - CAT_SUCCESS_BUT_WITH_NO_INFO ]
May  4 15:45:30 pid:163804 NOTICE: _rsModDataObjMeta - Failed updating the database with object info.
May  4 15:45:30 pid:163804 remote addresses: 129.114.60.127, 206.207.252.35, 206.207.252.42, 206.207.252.52, 206.207.252.53, 206.207.252.56, 206.207.252.60, 206.207.252.65, 206.207.252.69, 206.207.252.73, 206.207.252.74, 206.207.252.75, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: dataOpen: Could not update size of data object [status = -819000, path = /iplant/home/shared/VerSSA/XDA/MPC/nsd.obs]
May  4 15:45:30 pid:163804 NOTICE: _rsDataObjRepl - Failed to update replica.
May  4 15:45:30 pid:163804 NOTICE: rsDataObjRepl - Failed to replicate data object.
tedgin commented 1 year ago

I didn't have a reason to believe zero was a special value for repl_num, other than a hunch. Zero is often a special value is software.

tedgin commented 1 year ago

Here some information about our default resource logic and resource hierarchies.

Here's a truncated version of our resource hierarchy.

CyVerseRes:random
├── holmesBroker:passthru
│   └── holmes:unixfilesystem
...

taccRes:passthru
└── corral4:unixfilesystem

CyVerseRes has 37 child resources.

Here are the details of the source resource hierarchy.

resource name: CyVerseRes
id: 272212647
zone: iplant
type: random
class: cache
location: EMPTY_RESC_HOST
vault: EMPTY_RESC_PATH
free space: 
free space time: : Never
status: up
info: 
comment: 
create time: 01472831627: 2016-09-02.08:53:47
modify time: 01624387116: 2021-06-22.11:38:36
context: 
parent: 
parent context: 

resource name: holmesBroker
id: 307570185
zone: iplant
type: passthru
class: cache
location: EMPTY_RESC_HOST
vault: EMPTY_RESC_PATH
free space: 
free space time: : Never
status: up
info: 
comment: 
create time: 01496178747: 2017-05-30.14:12:27
modify time: 01675813282: 2023-02-07.16:41:22
context: write=1.0;read=1.0
parent: 272212647
parent context: 

resource name: holmes
id: 234073155
zone: iplant
type: unixfilesystem
class: archive
location: holmes.cyverse.org
vault: /irods_vault
free space: 26833101099008
free space time: 01683556251: 2023-05-08.07:30:51
status: up
info: 
comment: 
create time: 01444252820: 2015-10-07.14:20:20
modify time: 01683556251: 2023-05-08.07:30:51
context: minimum_free_space_for_create_in_bytes=26835874270003
parent: 307570185
parent context: 

Here are the details of the destination resource hierarchy.

resource name: taccRes
id: 762364370
zone: iplant
type: passthru
class: cache
location: EMPTY_RESC_HOST
vault: EMPTY_RESC_PATH
free space: 
free space time: : Never
status: up
info: 
comment: 
create time: 01629478645: 2021-08-20.09:57:25
modify time: 01674579600: 2023-01-24.10:00:00
context: read=0.9;write=0.9
parent: 
parent context:

resource name: corral4
id: 762364456
zone: iplant
type: unixfilesystem
class: cache
location: iplant-irods.tacc.utexas.edu
vault: /corral/irods/iplant/Vault
free space: 22843820982927360
free space time: 01683563607: 2023-05-08.09:33:27
status: up
info: 
comment: 
create time: 01629479048: 2021-08-20.10:04:08
modify time: 01683563607: 2023-05-08.09:33:27
context: minimum_free_space_for_create_in_bytes=4183508384953139
parent: 762364370
parent context: 

Our default resource logic is set up so that a file that is uploaded to the catalog service provider or any resource server hosting a resource in the CyVerseRes resource will have its default resource be CyVerseRes and its default replication resource be taccRes. Any file uploaded directly to the resource server at TACC will have its default resource be taccRes and its default replication resource be CyVerseRes. We do this by setting the default resource in irods_environment.json and server_config.json and by implementing the PEPs acSetRescSchemeForCreate and acSetRescSchemeForRepl.

The default resource set in irods_environment.json and server_config.json on the catalog provider and the source resource server holmes is CyVerseRes. The default resource on the destination resource server iplant is taccRes.

acSetRescSchemeForCreate executes msiSetDefaultResc(*resc, 'preferred') where *resc = 'CyVerseRes' on the catalog provider and the source resource server,*resc = 'taccRes' on the destination resource server.

Similarly, acSetRescSchemeForRepl executes msiSetDefaultResc(*replResc, 'preferred') where *replResc = 'taccRes' on the catalog provider and the source resource server, and *replResc = 'CyVerseRes' on the estination resource server.

alanking commented 1 year ago

Thanks for the info. We'll work on trying to reproduce this.

tedgin commented 1 year ago

This same issue prevents ibun -c from creating a bundle when a file being added to the bundle needs to be replicated to another resource and that file doesn't have a 0 replica. I'm guessing it fails because ibun is using the same logic as irepl in this case.

Here's an example.

ipc_admin@prod /i/h/chryslerherbarium? ibun -c -D gzipTar CHRB.tgz CHRB
remote addresses: 206.207.252.35 ERROR: bunUtil: opr error for /iplant/home/chryslerherbarium/CHRB, status = -310000 status = -310000 USER_FILE_DOES_NOT_EXIST

The log on the catalog provider shows the following error message.

Aug 22 10:38:50 pid:510733 remote addresses: 129.114.60.127, 206.207.252.53, 206.207.252.60, 206.207.252.64, 206.207.252.65, 206.207.252.69, 206.207.252.70, 206.207.252.74, 206.207.252.75, 206.207.252.76, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: [-]  /irods_git_repo/server/api/src/rsModDataObjMeta.cpp:164:int _rsModDataObjMeta(rsComm_t *, modDataObjMeta_t *) :  status [CAT_SUCCESS_BUT_WITH_NO_INFO]  errno [] -- message [_rsModDataObjMeta - Failed to modify the database for object "/iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg" - CAT_SUCCESS_BUT_WITH_NO_INFO ]
Aug 22 10:38:50 pid:510733 NOTICE: _rsModDataObjMeta - Failed updating the database with object info.
Aug 22 10:38:50 pid:510733 remote addresses: 129.114.60.127, 206.207.252.53, 206.207.252.60, 206.207.252.64, 206.207.252.65, 206.207.252.69, 206.207.252.70, 206.207.252.74, 206.207.252.75, 206.207.252.76, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: dataOpen: Could not update size of data object [status = -819000, path = /iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg]
Aug 22 10:38:50 pid:510733 NOTICE: chlModDataObjMeta cmlExecuteNoAnswerSql(rollback) succeeded
Aug 22 10:38:50 pid:510733 NOTICE: chlModDataObjMeta cmlModifySingleTable failure -819000
Aug 22 10:38:50 pid:510733 remote addresses: 129.114.60.127, 206.207.252.53, 206.207.252.60, 206.207.252.64, 206.207.252.65, 206.207.252.69, 206.207.252.70, 206.207.252.74, 206.207.252.75, 206.207.252.76, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: [-]  /irods_git_repo/server/api/src/rsModDataObjMeta.cpp:164:int _rsModDataObjMeta(rsComm_t *, modDataObjMeta_t *) :  status [CAT_SUCCESS_BUT_WITH_NO_INFO]  errno [] -- message [_rsModDataObjMeta - Failed to modify the database for object "/iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg" - CAT_SUCCESS_BUT_WITH_NO_INFO ]
Aug 22 10:38:50 pid:510733 NOTICE: _rsModDataObjMeta - Failed updating the database with object info.
Aug 22 10:38:50 pid:510733 remote addresses: 129.114.60.127, 206.207.252.53, 206.207.252.60, 206.207.252.64, 206.207.252.65, 206.207.252.69, 206.207.252.70, 206.207.252.74, 206.207.252.75, 206.207.252.76, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: dataOpen: Could not update size of data object [status = -819000, path = /iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg]
Aug 22 10:38:50 pid:510733 NOTICE: _rsDataObjRepl - Failed to update replica.
Aug 22 10:38:50 pid:510733 remote addresses: 129.114.60.127, 206.207.252.53, 206.207.252.60, 206.207.252.64, 206.207.252.65, 206.207.252.69, 206.207.252.70, 206.207.252.74, 206.207.252.75, 206.207.252.76, 206.207.252.77, 206.207.252.83, 206.207.252.84, 206.207.252.85, 206.207.252.86, 206.207.252.87 ERROR: chkCollForBundleOpr: /iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg no good copy in CyVerseRes [-819000]

Here's more info on the problem file.

[Tue 08/22 10:57:41] tedgin@ds-adm ~
ipc_admin@prod /i/h/chryslerherbarium? ils -L /iplant/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg       
  chryslerherb      1 taccRes;corral4      5644583 2020-03-20.10:00 & CHRB0065282.jpg
    afa266886ac982adc18a16d17edb1a28    generic    /corral/irods/iplant/Vault/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg
  chryslerherb      2 CyVerseRes;ds19Broker;ds19      5644583 2020-12-15.01:57 & CHRB0065282.jpg
    afa266886ac982adc18a16d17edb1a28    generic    /irods_vault/ds19/home/shared/pcc_tcn/CHRB/2019_08_29/CHRB0065282.jpg