irods / irods_client_nfsrods

An nfs4j Virtual File System implementation supporting the iRODS Data Grid
BSD 3-Clause "New" or "Revised" License
8 stars 9 forks source link

i/o error when a single replica is missing #42

Open anderbubble opened 5 years ago

anderbubble commented 5 years ago

Recently I got an i/o error from NFSRODS when trying to access a file.

[janderson@fox1 ~]$ sha1sum /mnt/nfsrods/home/janderson/spore.bb
sha1sum: /mnt/nfsrods/home/janderson/spore.bb: Input/output error

This file has three replicas under a replResc.

[janderson@fox1 ~]$ ilsresc
rootResc:passthru
└── replResc:replication
    ├── fox1Resc:unixfilesystem
    ├── mybook:unixfilesystem
    └── rsync_net:unixfilesystem
www:passthru
└── ln1:unixfilesystem

[janderson@fox1 ~]$ ils -AL spore.bb
  janderson         0 rootResc;replResc;fox1Resc        11101 2018-12-03.12:32 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /srv/civilfritz/irods/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own   
  janderson         1 rootResc;replResc;mybook        11101 2018-12-12.15:08 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /media/mybook/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own   
  janderson         2 rootResc;replResc;rsync_net        11101 2019-03-25.21:13 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /media/rsync.net/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own 

and one of these resources was unmounted. After mounting, it works.

[janderson@fox1 ~]$ sudo -u irods sshfs -o idmap=user 9807@usw-s009.rsync.net: /media/rsync.net
[janderson@fox1 ~]$ sha1sum /mnt/nfsrods/home/janderson/spore.bb
93b18b58ba1aa3cccdd0be0dfde67ecd73290e58  /mnt/nfsrods/home/janderson/spore.bb

I understand that this is ultimately a limitation in irods itself; but if there are replicas available, they should all be consulted before returning an i/o error.

anderbubble commented 5 years ago

rodsLog.txt

anderbubble commented 5 years ago

nfsrods-log.txt

anderbubble commented 5 years ago

Looking at nfsrods-log.txt, it also looks like this is leading to unhandled exceptions; so those should probably be caught and handled in any case. Maybe the result is still an i/o error; but it should be more intentional.

trel commented 5 years ago

Yes, both logs show a -510002 UNIX_FILE_OPEN_ERR because the rsync.net replica (which happened to be unmounted) apparently won the voting and was offered up as the replica to be retrieved and sent to the client.

We will handle that exception more clearly in NFSRODS.

Separately, we are working on adding a retry mechanism to the API calls themselves, so that if this occurs, the whole system can try again... (not clear it would actually help here, as this isn't an 'intermittent' failure... the disk is just not there). https://github.com/irods/irods/issues/3480

A monitoring system that marked the rsync_net resource as 'down' would also make it vote 0. This can be done manually with iadmin modresc rsync_net status down.

In the meantime, you can add another passthru under your replication and lower the 'read' weight for the rsync.net replica so it doesn't win the round of voting if the other replicas are available.

https://docs.irods.org/4.2.6/plugins/composable_resources/#passthru