phobos-storage / phobos

This repository holds the source code for Phobos, a Parallel Heterogeneous Object Store.
GNU Lesser General Public License v2.1
3 stars 2 forks source link

phobos drive del and scsi reservation #2

Open thiell opened 1 year ago

thiell commented 1 year ago

Really minor but reporting just to not forget: our LTO-9 drives are accessible from multiple hosts, and when deleting a drive with phobos drive del ... from a host and adding it to another with phobos drive add ..., then this drive won't work and LTFS complains about an existing SCSI reservation.

When phobosd is trying to use the drive from the other server, we can see errors like that:

Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853830000 <VERBOSE> fdcb LTFS30250I Opened the SCSI tape device 1.0.2.0 (/dev/sg4).
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853838000 <VERBOSE> fdcb LTFS30207I Vendor ID is IBM     .
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853843000 <VERBOSE> fdcb LTFS30208I Product ID is ULTRIUM-TD9     .
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853849000 <VERBOSE> fdcb LTFS30214I Firmware revision is Q3F4.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30250I Opened the SCSI tape device 1.0.2.0 (/dev/sg4).
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853855000 <VERBOSE> fdcb LTFS30215I Drive serial is 10210057FB.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30207I Vendor ID is IBM     .
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30208I Product ID is ULTRIUM-TD9     .
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.853915000 <VERBOSE> fdcb LTFS30285I The reserved buffer size of /dev/sg4 is 1048576.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30214I Firmware revision is Q3F4.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30215I Drive serial is 10210057FB.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30285I The reserved buffer size of /dev/sg4 is 1048576.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30205I RSOC (0xa3) returns -20601.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.854879000 <VERBOSE> fdcb LTFS30205I RSOC (0xa3) returns -20601.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.854892000 <VERBOSE> fdcb LTFS30263I RSOC returns Not Ready to Ready Transition, Medium May Have Changed (-20601) /dev/sg4.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.854901000 <VERBOSE> fdcb LTFS30262I Forcing drive dump.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.854906000 <VERBOSE> fdcb LTFS39802W Unknown SCSI OP code 0x1d, use default timeout.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30263I RSOC returns Not Ready to Ready Transition, Medium May Have Changed (-20601) /dev/sg4.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30262I Forcing drive dump.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS39802W Unknown SCSI OP code 0x1d, use default timeout.
Nov 09 21:41:35 elm-ent-dm02 kernel: st 1:0:2:0: Mode parameters changed
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30205I FORCE_DUMP (0x1d) returns -20604.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.860636000 <VERBOSE> fdcb LTFS30205I FORCE_DUMP (0x1d) returns -20604.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.860655000 <VERBOSE> fdcb LTFS30263I FORCE_DUMP returns Mode Parameters Changed (-20604) /dev/sg4.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.860662000 <VERBOSE> fdcb LTFS30262I Forcing drive dump.
Nov 09 21:41:35 elm-ent-dm02 phobosd[64958]: 2023-11-09 21:41:35.860668000 <VERBOSE> fdcb LTFS39802W Unknown SCSI OP code 0x1d, use default timeout.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30263I FORCE_DUMP returns Mode Parameters Changed (-20604) /dev/sg4.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30262I Forcing drive dump.
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS39802W Unknown SCSI OP code 0x1d, use default timeout.
Nov 09 21:41:35 elm-ent-dm02 kernel: st 1:0:2:0: reservation conflict
Nov 09 21:41:35 elm-ent-dm02 ltfs[64971]: fdcb LTFS30205I FORCE_DUMP (0x1d) returns -21719.

Especially this one I guess:

Nov 09 21:41:35 elm-ent-dm02 kernel: st 1:0:2:0: reservation conflict

A solution is to release the SCSI reservation on the original server with the following command:

# ltfs -o release_device -o devname=/dev/sg4 
126e LTFS14000I LTFS starting, LTFS version 2.4.5.1 (Prelim), log level 2.
126e LTFS14058I LTFS Format Specification version 2.4.0.
126e LTFS14104I Launched by "ltfs -o release_device -o devname=/dev/sg4".
126e LTFS14105I This binary is built for Linux (x86_64).
126e LTFS14106I GCC version is 11.3.1 20221121 (Red Hat 11.3.1-4).
126e LTFS17087I Kernel version: Linux version 5.14.0-284.25.1.el9_2.x86_64 (mockbuild@iad1-prod-build001.bld.equ.rockylinux.org) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Wed Aug 2 14:53:30 UTC 2023 i386.
126e LTFS17089I Distribution: Rocky Linux release 9.2 (Blue Onyx).
126e LTFS17089I Distribution: NAME="Rocky Linux".
126e LTFS17089I Distribution: Rocky Linux release 9.2 (Blue Onyx).
126e LTFS17089I Distribution: Rocky Linux release 9.2 (Blue Onyx).
126e LTFS14063I Sync type is "time", Sync time is 300 sec.
126e LTFS17085I Plugin: Loading "sg" tape backend.
126e LTFS17085I Plugin: Loading "unified" iosched backend.
126e LTFS14095I Set the tape device write-anywhere mode to avoid cartridge ejection.
126e LTFS30209I Opening a device through sg-ibmtape driver (/dev/sg4).
126e LTFS30250I Opened the SCSI tape device 1.0.2.0 (/dev/sg4).
126e LTFS30207I Vendor ID is IBM     .
126e LTFS30208I Product ID is ULTRIUM-TD9     .
126e LTFS30214I Firmware revision is Q3F4.
126e LTFS30215I Drive serial is 10210057FB.
126e LTFS30285I The reserved buffer size of /dev/sg4 is 1048576.
126e LTFS30294I Setting up timeout values from RSOC.
126e LTFS17160I Maximum device block size is 1048576.
126e LTFS12022I Unloading medium.
126e LTFS30252I Logical block protection is disabled.

After that, the drive can be used from the other server by phobos.

Perhaps phobos drive del could do that automatically? Or a note in the documentation about that would be less confusing.

SebaGougeaud commented 1 year ago

Hi @thiell, For now, the phobos drive del only deals with the database. We may add this information in the documentation. We are currently thinking of adding a drive_release-like feature for the 2.1 version, which is planned for June 2024.

thiell commented 7 months ago

@SebaGougeaud We now think that when stopping phobosd, the daemon should release its drives, otherwise there is no way multiple phobos instances can properly recover without sysadmin intervention to release the drives. Imagine a scenario with a first data mover dm01 with phobosd, that we stop for maintenance, tapes mounted in the drives. If the daemon does not release the drives when stopping, the other data movers (for example dm[02-03]) will fail trying to grab the tapes previously mounted by dm01, and that will fail both the mounted tapes and the drives on the other data movers dm[02-03]. Please let me know if there is a case the daemon should not release its own drives when stopping... thanks!

patlucas commented 6 months ago

@thiell What do you mean by the daemon should "release" its drives ? Do you mean removing any phobos DSS lock ? or do you mean umounting and unloading any tape from any of its drives ? Or any thing else ?

thiell commented 6 months ago

@patlucas: Good question indeed, I mean both phobos DSS lock (lock remaining in the lock table after phobosd being stopped) and also the LTFS device reservation that can be released with ltfs -o release_device. That way, after phobosd has been stopped, the cartridge (still in the drive) can be taken over by another data mover / phobosd instance. Otherwise, this leads to a deadlock situation. I will try to provide relevant logs with the new phobos version (based on current master), but I have some compatibility issues with lhsmtool_phobos / coordinatool right now and can't make it work yet.

patlucas commented 6 months ago

As already said, we plan to add an admin command "phobos drive release" to manage the ltfs device reservation. This feature is planned in the phobos 3.0 milestone. We are currently finishing phobos 2.0.

Migration of a drive need an admin command because drives are currently dedicated to a node and this is registered in the DSS.

Migration of a drive from one node to an other will be redesign and taken into account through admin commands in phobos 3.0 .

thiell commented 6 months ago

@patlucas ok no problem for the drives and phobos 3.0, but would you also be releasing the ltfs device reservation when the phobosd daemon stops? For now, we can put a ExecStopPost that would always release ltfs device reservation (otherwise, the tape in the drive cannot be reclaimed by other phobosd).

@patlucas What about the DSS lock release when phobosd is stopped?

For example here we stopped phobosd on elm-ent-dm01 (this is with 1.95.1 not master):

May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995937000 <ERROR> Media '054840L9' is locked by (hostname: elm-ent-dm01, owner: 3688211): Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995954000 <ERROR> Device '/dev/sg5' (S/N '10230057FB') is owned by host elm-ent-dm02 but contains medium '054840L9' which is locked by an other hostname elm-ent-dm01: Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995961000 <ERROR> Fail to init device '/dev/sg5', stopping corresponding device thread: Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995980000 <ERROR> setting medium '054840L9' to failed
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998588000 <ERROR> Request failed: PHLK2: Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998594000 <ERROR> Error when releasing medium '054840L9' with current lock (hostname elm-ent-dm01, owner 3688211): Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998597000 <ERROR> Error when releasing medium 054840L9 after setting it to status failed: Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998599000 <ERROR> setting device '10230057FB' to failed
patlucas commented 6 months ago

We will indeed try to release the ltfs reservation through a phobos admin command and clean DSS locks.

thiell commented 6 months ago

Awesome, thanks @patlucas, I appreciate your quick answers!

courrierg commented 6 months ago

@patlucas ok no problem for the drives and phobos 3.0, but would you also be releasing the ltfs device reservation when the phobosd daemon stops? For now, we can put a ExecStopPost that would always release ltfs device reservation (otherwise, the tape in the drive cannot be reclaimed by other phobosd).

@patlucas What about the DSS lock release when phobosd is stopped?

For example here we stopped phobosd on elm-ent-dm01 (this is with 1.95.1 not master):

May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995937000 <ERROR> Media '054840L9' is locked by (hostname: elm-ent-dm01, owner: 3688211): Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995954000 <ERROR> Device '/dev/sg5' (S/N '10230057FB') is owned by host elm-ent-dm02 but contains medium '054840L9' which is locked by an other hostname elm-ent-dm01: Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995961000 <ERROR> Fail to init device '/dev/sg5', stopping corresponding device thread: Operation already in progress (114)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995980000 <ERROR> setting medium '054840L9' to failed
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998588000 <ERROR> Request failed: PHLK2: Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998594000 <ERROR> Error when releasing medium '054840L9' with current lock (hostname elm-ent-dm01, owner 3688211): Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998597000 <ERROR> Error when releasing medium 054840L9 after setting it to status failed: Permission denied (13)
May  1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998599000 <ERROR> setting device '10230057FB' to failed

phobosd should not leave DSS locks on the media it uses unless an error occurred. It would be interesting to see the logs of phobosd when it stops. Either you have an error message that indicates that phobosd did not release the lock or there is a bug.

There was some refactoring of that part of the code. Master is in relatively unstable position right now. The rest of the patches that should fix the bugs is partially integrated and the rest will soon be. Hopefully, by the end of the day everything will be pushed to master. (There is a new health feature that can be configured through the max_health parameter that is coming with it).

GauthierEvd commented 1 month ago

We have tested on master and we have noticed that when the phobosd is stopped all the DSS locks are released. Also, all the tapes mounted are unmount with ltfs umount which release the SCSI reservation on the drive. But the tapes are still loaded into the drive.

However, we have seen that when phobosd crashed, all the DSS lock and all SCSI reservations are still present. Also, we think that when unloading a drive failed, there could be the same problem.

We can add a phobos admin command which releases the SCSI reservation of a drive. Using this command requires the admin to ask several questions: is the tape still mounted ? are the DSS locks on the tape and drive still present ? what is the status of the drive and tape ?

thiell commented 1 month ago

Hi @GauthierEvd,

To mitigate the SCSI reservation issues (when phobosd crashes for example), we added a script that will release all SCSI reservations when the data movers freshly starts. That way, we can let the sysadmin know that if a data mover is rebooted, phobosd should start without problem. Note that we do not move drives between data movers, so we know a data mover will always only release its own drives. And the script might be specific to our configuration as we use SAS tape drives, but it's just to show you how we mitigated the problem. I also added the script as ExecStopPost for normal phobosd shutdown just in case (also I've seen cases where ltfs were still mounted after phobosd stopped or maybe timed out). Not super elegant but this method has worked okay for us so far, when rebooting a data mover or phobosd. I'm sure we can do better though.

[root@elm-ent-dm01 ~]# cat /usr/lib/systemd/system/phobosd.service.d/override.conf 
[Unit]
After=phobos_release_device_local.service
[Service]
LimitNOFILE=262144
EnvironmentFile=-/etc/sysconfig/phobos_release_device_local
ExecStopPost=/usr/bin/phobos_release_device_local.py
# workaround: increase start timeout due to TLC serializing all requests
TimeoutStartSec=3600
TimeoutStopSec=900
[root@elm-ent-dm01 ~]# cat /etc/systemd/system/phobos_release_device_local.service
[Unit]
Description=Phobos Device Release
After=network-online.target

[Service]
Type=oneshot
EnvironmentFile=-/etc/sysconfig/phobos_release_device_local
ExecStart=/usr/bin/phobos_release_device_local.py 

[Install]
WantedBy=multi-user.target
[root@elm-ent-dm01 ~]# cat /etc/sysconfig/phobos_release_device_local
PHOBOS_DB_HOST="10.4.0.132"
PHOBOS_DB_PORT=5432
PHOBOS_DB_NAME="phobos"
PHOBOS_DB_USER="phobos"
PHOBOS_DB_PASS="<redacted>"
#!/usr/bin/python3
# Stanford Research Computing - Elm storage system
# Written by Stephane Thiell <sthiell@stanford.edu>
#
# Make sure to release local devices before we start Phobos

import argparse
import logging
import os
import os.path
import psycopg2
import socket
from subprocess import Popen, PIPE
import sys

from ClusterShell.Event import EventHandler
from ClusterShell.Task import task_self
from sasutils.sas import SASTapeDevice
from sasutils.sysfs import sysfs

# PostgreSQL Phobos DB
DB_HOST=os.environ["PHOBOS_DB_HOST"]
DB_PORT=os.environ["PHOBOS_DB_PORT"]
DB_NAME=os.environ["PHOBOS_DB_NAME"]
DB_USER=os.environ["PHOBOS_DB_USER"]
DB_PASS=os.environ["PHOBOS_DB_PASS"]

############# End of phobos DB config #############

HOSTNAME = socket.gethostname().split('.')[0]

def db_connect():
    return psycopg2.connect(host=DB_HOST,
                            port=DB_PORT,
                            dbname=DB_NAME,
                            user=DB_USER,
                            password=DB_PASS)

def phobos_drive_list():
    drivelist = None
    conn = db_connect()
    try:
        cur = conn.cursor()
        try:
            cur.execute("select id, path from device where family='tape' and host='%s';" % HOSTNAME)
            drivelist = cur.fetchall()
        except psycopg2.Error as err:
            logging.error(err)
        finally:
            cur.close()
    finally:
        conn.close()
    return drivelist

class LTFSHandler(EventHandler):

   def __init__(self, num_devices):
       EventHandler.__init__(self)
       self.done = 0
       self.num_devices = num_devices
       self._promptfmt = '[%d/%d] '

   @property
   def prompt(self):
       return self._promptfmt % (self.done, self.num_devices)

   def ev_read(self, worker, node, sname, msg):
       print("%s%s: %s" % (self.prompt, node, msg.decode()))

   def ev_hup(self, worker, node, rc):
       self.done += 1
       if rc > 1:
           print("%s%s: returned with error code \033[91m%s\033[0m" % (self.prompt, node, rc))
       else:
           print("%s%s: returned with error code %s" % (self.prompt, node, rc))

def _init_argparser():
    parser = argparse.ArgumentParser()
    return parser.parse_args()

def main():
    """Entry point for the oak_md_bot script."""
    pargs = _init_argparser()
    drivelist = phobos_drive_list()
    num_devices = 0
    if drivelist:
        print("Found %d drives:" % len(drivelist))
        for driveid, drivepath in drivelist:
            print("  %10s at %10s [%s]" % (driveid, drivepath, "OK" if os.path.exists(drivepath) else "PATH NOT FOUND"))
            num_devices += 1

        task = task_self()
        task.set_default("stderr", False) # merge stdout and stderr as ltfsck outputs to stderr
        task.set_default("stdout_msgtree", False)
        task.set_default("stderr_msgtree", False)
        eh = LTFSHandler(num_devices)

        for driveid, drivepath in drivelist:
            if os.path.exists(drivepath):
                realst = os.path.basename(os.path.realpath(drivepath))
                tapedev = SASTapeDevice(sysfs.node('class').node('scsi_tape') \
                                        .node(realst).node('device'))

                sg_name = tapedev.scsi_device.scsi_generic.sg_name
                task.shell("ltfs -o devname=/dev/%s -o release_device" % sg_name,
                           key="%s(%s)" % (driveid, sg_name),
                           handler=eh)
        task.run()
    else:
        print("No drives found! Aborting.")

if __name__ == '__main__':
    main()
patlucas commented 1 month ago

Thanks Stéphane for all these details.

We are adding a "phobos drive release" command to the phobos cli to execute the "ltfs -o release_device" action. The corresponding patch is currently in review.

When we try to reproduce your problem, we see that when the phobosd crashes (without stopping correctly and without umounting its loaded tapes), we need to manually remove the "ltfs lock" to allow an other host to use the drive. But we don't test if this ltfs lock also block the same phobosd host to restart and reuse this drive. We will test it. If it is the case, we will see if we need to integrate this "ltfs release lock" to the start of the phobosd daemon (as we already currently manage existing phobos lock into the DSS when a phobosd starts).

One detail to finish, into your service script, do not hesitate to use as much possible integrated phobos commands instead of looking to the phobos DSS . For example, to list existing drive, you can use the "phobos drive list" command instead of requesting the DSS.