Open chrboe opened 1 year ago
Caveat: I am not the owner of targetcli-fb, just a contributor.
I wonder what parts of the actual code in LIO/targetcli-fb need to be mutually exclusive, i.e. what parts of the code screw up if we have multiple stacks running at the same time. But perhaps that is a much harder problem to solve? (I haven't looked at why a lock is needed, myself.)
With this approach, what is your plan for a user running the CLI, and things are going fine, then the next thing they type fails because some other process got the lock and is taking a while? Would the first user see an error like "sorry, busy, please retry", or would the request hang until they got the lock? And what about batch mode, i.e. feeding a bunch of commands to targetcli-fb from stdin via a script. What if half the commands succeed, but then one of them fails. Should the script fail? Or should all scripts be re-written to handle the lock being stolen out from under us?
Maybe you'd need a new global configuration variable that says "wait_if_lock_stolen"?
Also, what if we are waiting for a lock in the middle of our script, and the other process dies/core dumps, and leaves the lock hanging. Normally, at startup, we check for this condition (I believe), but now we'd have to check each time the lock is held by somebody else?
The pull request I linked contains some details about why the lock is supposedly necessary (including a reproducer for a genuine bug, so I do believe a lock of some sort makes sense).
what is your plan for a user running the CLI, and things are going fine, then the next thing they type fails because some other process got the lock and is taking a while?
I think some inspiration can be drawn from prior art here. The first example that comes to mind is dnf
, which also acquires a global lock (but does not have an interactive CLI, I guess...). When a dnf
process is already running and you start a new one, it will simply wait for the lock to be given up. Seems intuitive enough.
In general, I would intuitively expect every command in "interactive mode" to behave exactly as it would if it were executed from the shell directly.
And what about batch mode, i.e. feeding a bunch of commands to targetcli-fb from stdin via a script
That is a more interesting problem. I guess for scripts I would expect the current behavior (i.e. take the lock when targetcli
is invoked and only release it when the process exits).
Summing up, I would expect the following:
$ targetcli ls
# takes the lock when targetcli starts, releases on exit (== current behavior)
$ targetcli
# does not take the lock
>/ ls
# takes the lock for the duration of the "ls" command
$ echo "ls\niscsi/ create ..." | targetcli
# takes the lock when targetcli starts, releases on exit (== current behavior)
So the only thing I am really proposing is that the lock should not be acquired when an interactive shell is idle.
Currently (as of #140), targetcli takes a global
/var/lock/targetcli.lock
. This poses a problem when combined with interactive mode because the lock will be taken the entire time the targetcli shell is open.When an administrator leaves a targetcli shell open (perhaps even unknowingly in some SSH session or
screen
in the background), this effectively means that no targetcli operations can be completed, leading to disastrous consequences when a cluster manager such as Pacemaker is involved. In the event of a failover, the iSCSI target cannot start on the new primary node.We have observed this behavior leading to user confusion and – in one instance – even an actual production outage.
My proposal is: when running in interactive mode, only take the global lock when some action is actually running. When the shell is just sitting there idly, there is no need to take the lock.
Of course, this implies a potentially unwanted behavioral change (hence the request for comments from the maintainers): in the current version, it is guaranteed1 that the state won't change while an interactive shell is open.
Please let me know your thoughts about this issue.
1 As guaranteed as it can be, considering you could still manually mess around in the configfs