lanoxx / tilda

A Gtk based drop down terminal for Linux and Unix
GNU General Public License v2.0
1.25k stars 165 forks source link

Race condition when starting up several instances of tilda #502

Open voidplayer opened 1 year ago

voidplayer commented 1 year ago

Im running debian testing and tilda 1.5.4-1

I think this is the same as #82

I run 4 instances of tilda when starting up the pc. Every time a different thing happens. Some times it says invalid key, but some other times it my left tilda has the width that the right tilda is supposed to have and viceversa, as if configurations mixed

Just going into the conf and clicking any setting fixes everything

Some times it even starts correctly, but its rare :)

lanoxx commented 7 months ago

I agree that this can be annoying, the locking code is very old and has not been refactored in a long time. It seems the last bigger change of the locking mechanism was (as you mentioned as part of #82) and we introduced flock in commit 0a66b99dcff82cf90d7cb7dc526676cbadfc52f3. There is certainly more potential to clean it up. There are even still goto statements in the code :eyes: .

Our basic approach is to create a file named lock_0_0 that is created exclusively with flock, then synchronize on that and create instance specific pid lock files for individual processes. This way multiple tilda processes synchronize first on the lock_0_0 lock file and if they acquire the lock, then they create their own instance specific lock file (all in ~/.cache/tilda/locks) by picking an instance that is not yet taken (counting from 0 upwards. This way, all processes can determine, which instance number they should receive.

So what could be the problem. The locking code that produces the atomic section is here: https://github.com/lanoxx/tilda/blob/master/src/tilda-lock-files.c#L61, one problem that I see is that we do not check the return value of flock and thus, if this code fails we will go ahead and pretend that we hold the lock :roll_eyes: .

There might be other problems (such as #480).

If we keep flock, then one thing we should definitely do is check the return values of flock. One reason that flock can return early is if the process is interrupted by a signal (maybe SIGHUP or SIGUSR{1,2}) by any chance? Another reason could be, if for some reason the file descriptor is not valid.

However, I played a bit with it and using the following snippet I can reproduce the issue in some occasions:

(for i in $(seq 0 10); do src/tilda --version & src/tilda --version & src/tilda --version & src/tilda --version; echo "DONE"; echo ""; done; ) | less

Now I think the issue is that we are deleting the global lock file after leaving the critical section. This means at a newly starting tilda instance needs to recreate it and will then lock on a file that has a different inode then the original lock file. If there are three tilda process then I believe the following could happen:

  1. P1 creates the lock file, locks it, enters the critical section.
  2. P2 locks on the same file but blocks on the lock.
  3. P1 finishes and deletes the lock file, the lock file is now in a zombie state (its deleted but P2 still blocks on its inode and will enter the critical section after P1 leaves it.
  4. P3 starts, since the file was deleted it creates it again and it receives a different inode. P2 is still in the critical section, but P3 can enter it too because it will lock on a file with a different inode.

A possible fix might be to not delete the lock_0_0 file (see https://github.com/lanoxx/tilda/blob/master/src/tilda-lock-files.c#L70).

I think the original idea was to make this more "clean" and remove the file after the lock is released, to avoid left-over lockfiles when no tilda process is running. Another alternative could be to try and use POSIX semaphores since they do not need files.

lanoxx commented 5 months ago

I pushed this commit today: 5ecc56105bb099134d4edf76b8987bfd2b523e37 which should improve the locking situation.

dimich-dmb commented 4 months ago

I observe a kind of race condition even with a single instance. I have tilda run by window manager (fluxbox) only once at startup. Most of times it works fine but rarely (i haven't gathered statistics, subjectively 1 of 100 runs) it picks up config_1 (and creates it if absent) instead of config_0. I'm not sure it's related to this issue but it looks like. Maybe remove automatic instance detection and allow user to specify instance id explicitly?

lanoxx commented 4 months ago

Are you running tilda 2.0.0? Or the 1.x version?

dimich-dmb commented 4 months ago

Tilda 2.0.0