Open thiell opened 7 months ago
Hi @thiell,
For now, the phobos drive del
only deals with the database. We may add this information in the documentation.
We are currently thinking of adding a drive_release
-like feature for the 2.1 version, which is planned for June 2024.
@SebaGougeaud We now think that when stopping phobosd
, the daemon should release its drives, otherwise there is no way multiple phobos instances can properly recover without sysadmin intervention to release the drives. Imagine a scenario with a first data mover dm01 with phobosd, that we stop for maintenance, tapes mounted in the drives. If the daemon does not release the drives when stopping, the other data movers (for example dm[02-03]) will fail trying to grab the tapes previously mounted by dm01, and that will fail both the mounted tapes and the drives on the other data movers dm[02-03].
Please let me know if there is a case the daemon should not release its own drives when stopping... thanks!
@thiell What do you mean by the daemon should "release" its drives ? Do you mean removing any phobos DSS lock ? or do you mean umounting and unloading any tape from any of its drives ? Or any thing else ?
@patlucas: Good question indeed, I mean both phobos DSS lock (lock remaining in the lock
table after phobosd being stopped) and also the LTFS device reservation that can be released with ltfs -o release_device
. That way, after phobosd has been stopped, the cartridge (still in the drive) can be taken over by another data mover / phobosd instance. Otherwise, this leads to a deadlock situation.
I will try to provide relevant logs with the new phobos version (based on current master), but I have some compatibility issues with lhsmtool_phobos / coordinatool right now and can't make it work yet.
As already said, we plan to add an admin command "phobos drive release" to manage the ltfs device reservation. This feature is planned in the phobos 3.0 milestone. We are currently finishing phobos 2.0.
Migration of a drive need an admin command because drives are currently dedicated to a node and this is registered in the DSS.
Migration of a drive from one node to an other will be redesign and taken into account through admin commands in phobos 3.0 .
@patlucas ok no problem for the drives and phobos 3.0, but would you also be releasing the ltfs device reservation when the phobosd daemon stops? For now, we can put a ExecStopPost that would always release ltfs device reservation (otherwise, the tape in the drive cannot be reclaimed by other phobosd).
@patlucas What about the DSS lock release when phobosd is stopped?
For example here we stopped phobosd on elm-ent-dm01
(this is with 1.95.1 not master):
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995937000 <ERROR> Media '054840L9' is locked by (hostname: elm-ent-dm01, owner: 3688211): Operation already in progress (114)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995954000 <ERROR> Device '/dev/sg5' (S/N '10230057FB') is owned by host elm-ent-dm02 but contains medium '054840L9' which is locked by an other hostname elm-ent-dm01: Operation already in progress (114)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995961000 <ERROR> Fail to init device '/dev/sg5', stopping corresponding device thread: Operation already in progress (114)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995980000 <ERROR> setting medium '054840L9' to failed
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998588000 <ERROR> Request failed: PHLK2: Permission denied (13)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998594000 <ERROR> Error when releasing medium '054840L9' with current lock (hostname elm-ent-dm01, owner 3688211): Permission denied (13)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998597000 <ERROR> Error when releasing medium 054840L9 after setting it to status failed: Permission denied (13)
May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998599000 <ERROR> setting device '10230057FB' to failed
We will indeed try to release the ltfs reservation through a phobos admin command and clean DSS locks.
Awesome, thanks @patlucas, I appreciate your quick answers!
@patlucas ok no problem for the drives and phobos 3.0, but would you also be releasing the ltfs device reservation when the phobosd daemon stops? For now, we can put a ExecStopPost that would always release ltfs device reservation (otherwise, the tape in the drive cannot be reclaimed by other phobosd).
@patlucas What about the DSS lock release when phobosd is stopped?
For example here we stopped phobosd on
elm-ent-dm01
(this is with 1.95.1 not master):May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995937000 <ERROR> Media '054840L9' is locked by (hostname: elm-ent-dm01, owner: 3688211): Operation already in progress (114) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995954000 <ERROR> Device '/dev/sg5' (S/N '10230057FB') is owned by host elm-ent-dm02 but contains medium '054840L9' which is locked by an other hostname elm-ent-dm01: Operation already in progress (114) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995961000 <ERROR> Fail to init device '/dev/sg5', stopping corresponding device thread: Operation already in progress (114) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.995980000 <ERROR> setting medium '054840L9' to failed May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998588000 <ERROR> Request failed: PHLK2: Permission denied (13) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998594000 <ERROR> Error when releasing medium '054840L9' with current lock (hostname elm-ent-dm01, owner 3688211): Permission denied (13) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998597000 <ERROR> Error when releasing medium 054840L9 after setting it to status failed: Permission denied (13) May 1 11:24:54 elm-ent-dm02 phobosd[5081]: 2024-05-01 11:24:54.998599000 <ERROR> setting device '10230057FB' to failed
phobosd should not leave DSS locks on the media it uses unless an error occurred. It would be interesting to see the logs of phobosd when it stops. Either you have an error message that indicates that phobosd did not release the lock or there is a bug.
There was some refactoring of that part of the code. Master is in relatively unstable position right now. The rest of the patches that should fix the bugs is partially integrated and the rest will soon be. Hopefully, by the end of the day everything will be pushed to master. (There is a new health feature that can be configured through the max_health parameter that is coming with it).
Really minor but reporting just to not forget: our LTO-9 drives are accessible from multiple hosts, and when deleting a drive with
phobos drive del ...
from a host and adding it to another withphobos drive add ...
, then this drive won't work and LTFS complains about an existing SCSI reservation.When phobosd is trying to use the drive from the other server, we can see errors like that:
Especially this one I guess:
A solution is to release the SCSI reservation on the original server with the following command:
After that, the drive can be used from the other server by phobos.
Perhaps
phobos drive del
could do that automatically? Or a note in the documentation about that would be less confusing.