phobos-storage / phobos

This repository holds the source code for Phobos, a Parallel Heterogeneous Object Store.
GNU Lesser General Public License v2.1
3 stars 2 forks source link

drive lock: Request failed: PHLK1: No locks available (37) #16

Open thiell opened 2 weeks ago

thiell commented 2 weeks ago

With latest master branch, I wanted to run ltfsck on a tape that had fail:

[root@elm-ent-dm01 ~]# phobos tape list -o all | grep fail
|           1 | failed       | True            | tape     | True         | legacy    | None            |            0 | 0                          | LTO9    | 058164L9 | True         | ['mr', 'p-test', 'n0']  | LTFS      | used        | 058164L9   |          53806 | 156.3GB               | 193.7GB               | 15.0TB                |               0 |                 0 |                 0 |
[root@elm-ent-dm01 ~]# phobos lib scan | grep 058164L9
drive: address=0x103 source_address=0x1236 device_id='10140057FB' volume='058164L9' accessible full
[root@elm-ent-dm01 ~]# phobos drive status
|   address | currently_dedicated_to   | device    | media    | mount_path       | name                            | ongoing_io   | serial     |
|-----------|--------------------------|-----------|----------|------------------|---------------------------------|--------------|------------|
|         0 | W                        | /dev/sg2  | 057655L9 | /mnt/phobos-sg2  | /dev/tape/by-id/scsi-10110057FB | True         | 10110057FB |
|         1 | W                        | /dev/sg12 | 058160L9 | /mnt/phobos-sg12 | /dev/tape/by-id/scsi-10120057FB | True         | 10120057FB |
|         2 |                          | /dev/sg1  | 057629L9 |                  | /dev/tape/by-id/scsi-10130057FB | False        | 10130057FB |
|         3 | W                        | /dev/sg11 | 058164L9 |                  | /dev/tape/by-id/scsi-10140057FB | False        | 10140057FB |

But when I locked the corresponding drive:

[root@elm-ent-dm01 ~]# phobos drive lock 10140057FB
2024-09-14 12:17:18,195 <WARNING> Device (path: '/dev/tape/by-id/scsi-10140057FB', name: '10140057FB') is in use. Administrative locking will not be effective immediately
2024-09-14 12:17:18,196 <INFO> 1 device(s) locked

I saw these phobosd errors:

Sep 14 12:17:18 elm-ent-dm01 phobosd[3197]: 2024-09-14 12:17:18.196968000 <ERROR> Request failed: PHLK1: No locks available (37)
Sep 14 12:17:18 elm-ent-dm01 phobosd[3197]: 2024-09-14 12:17:18.196996000 <ERROR> Error when releasing medium (family 'tape', name '058164L9', library 'legacy') with current lock (hostname (null), owner 0): No locks available (37)
Sep 14 12:17:18 elm-ent-dm01 phobosd[3197]: 2024-09-14 12:17:18.197000000 <ERROR> unable to release DSS lock of medium (family 'tape', name '058164L9', library 'legacy') in device (family 'tape', name '10140057FB', library 'legacy') during regular exit: No locks available (37)
Sep 14 12:17:18 elm-ent-dm01 phobosd[3197]: 2024-09-14 12:17:18.197003000 <ERROR> failed to cleanup stopping device (family 'tape', name '10140057FB', library 'legacy'): No locks available (37)
Sep 14 12:17:18 elm-ent-dm01 phobosd[3197]: 2024-09-14 12:17:18.197306000 <ERROR> device thread (family 'tape', name '10140057FB', library 'legacy') terminated with error: No locks available (37)

However, locking seems to have worked:

[root@elm-ent-dm01 ~]# phobos drive list -o all 10140057FB
| adm_status   | family   | host         | library   | lock_hostname   |   lock_owner | lock_ts                    | model       | name       | path                            |
|--------------|----------|--------------|-----------|-----------------|--------------|----------------------------|-------------|------------|---------------------------------|
| locked       | tape     | elm-ent-dm01 | legacy    | elm-ent-dm01    |         3197 | 2024-09-14 11:59:43.443118 | ULTRIUM-TD9 | 10140057FB | /dev/tape/by-id/scsi-10140057FB |

It's a bit unclear to me why phobosd complained.

courrierg commented 1 week ago

Hello @thiell,

Phobosd is complaining here because each active drive should have a lock in the lock table of the DSS. This lock is different from the adm_status locked. This is a concurrency lock that is used to avoid concurrent access to resources. Phobosd is complaining that the drive has no lock which is not normal. This is most likely a bug. In your case, since you used phobos drive lock, phobosd was trying to remove this drive from its list and one of the steps to do this is to remove the DSS lock which it failed to do. In your case, this is not an issue since you locked the drive so everything should be fine. But there is definitively a bug that we need to investigate. Was this on the latest master branch?

One thing that you could do is check that all the tapes and drives that are in use do have a lock in the locks table. This is especially important for tapes otherwise several phobosd might want to use the same tape. You have to do an SQL query to list the locks unfortunately. There is no lock list command for now. select * from lock; should do the trick.

One possible cause for the issue might be a double unlock which we have seen in the past. To check this, you can run phobosd at the debug level and you should see the logs for all the lock and unlock operations. They are prefixed with lock: or unlock: for easier grep. If you see this each time you run phobos drive unlock, this might be an indication that this is in fact a double unlock.

thiell commented 1 week ago

Hello @courrierg!

Thanks! Yes, it was with the latest master branch. I don't lock drives very often but I will report back if it happens every time.