mschout / perl-log-dispatch-filerotate

Log to files that archive/rotate themselves
2 stars 8 forks source link

Related to CPAN Bug #77075 #7

Closed targzeta closed 7 years ago

targzeta commented 7 years ago

Hi Micheal, I think I found a bug that can be related to the bug #77075 on CPAN.

If we have many little independent processes, when the first proccess exits the .LCK file is deleted by the DESTROY subroutine. Now, if a new process start, it acquire a lock on a new .LCK file, but what if we has a process locked on the old .LCK file? Both process are now in the critical area.

We can solve this issue simply avoiding to unlink the .LCK file.

What do you think about? Regards, Emanuele

mschout commented 7 years ago

Thanks Emanuele

That sounds entirely plausible and would probably be a good idea.

Another solution might be to do reimplement something like BSD's flopen() when opening the .LCK file. This handles a whole bunch of race conditions.

See https://github.com/freebsd/freebsd/blob/master/lib/libutil/flopen.c for example.

There are a lot of things that can go wrong that could lead to situations described in #77075 and this could certainly be one of them.

I'm sure there are other cases causing it as well. For instance, there is little or no checking done on the result of flock() in some places in this module. E.g.: in lock() there is no checking at all on the result of flock() (!). On a busy system, flock() might also be interrupted by syscall (errno = EINTR). lock() in this module will return as if the lock was acquired. These calls probably need to be rewritten to something like:

while (!flock($fh, LOCK_EX) { 
    unless ($! == EINTR) {
        # warn or something because flock failed
    }
}
targzeta commented 7 years ago

Micheal, the following snippet of code show what I meant:

use strict;
use warnings;

use Log::Dispatch::FileRotate;

my $file = Log::Dispatch::FileRotate
  ->new(
        filename  => 'test.log',
        min_level => 'info',
        DEBUG => 1,
       );

my $child1_pid = fork();
if ($child1_pid == 0) {
  $file->log(level => 'info', message => "first_child\n");
} else {
  sleep(1);
  my $child2_pid = fork();
  if ($child2_pid == 0) {
    $file->log(level => 'info', message => "second_child\n");
  } else {
    waitpid($child1_pid, 0);
    $file->log(level => 'info', message => "parent\n");
  }
}

warn "$$ exiting";

I have also adding a sleep(5) before $self->logit at line 384 of FileRotate.pm.

This is the output:

Lock file is ./.test.log.LCK time_to_rotate(mode,rotate) => (,0) 10320 got lock on Lock File Log::Dispatch::FileRotate::LFH Wed Jul 26 23:35:13 2017 10320 s=0, i=869574, f=869574, n=test.log Wed Jul 26 23:35:13 2017 10320 in time mode: ; time to rotate: 0; rotate by size: 0; check_both: 0; have to rotate: 0 time_to_rotate(mode,rotate) => (,0) 10321 waiting on lock Wed Jul 26 23:35:18 2017 10320 Locked Wed Jul 26 23:35:18 2017 10320 unLocked 10320 exiting at test.pl line 26. 10321 got lock after wait Wed Jul 26 23:35:18 2017 10321 s=12, i=869574, f=869574, n=test.log Wed Jul 26 23:35:18 2017 10321 in time mode: ; time to rotate: 0; rotate by size: 0; check_both: 0; have to rotate: 0 time_to_rotate(mode,rotate) => (,0) 10319 got lock on Lock File Log::Dispatch::FileRotate::LFH Wed Jul 26 23:35:18 2017 10319 s=12, i=869574, f=869574, n=test.log Wed Jul 26 23:35:18 2017 10319 in time mode: ; time to rotate: 0; rotate by size: 0; check_both: 0; have to rotate: 0 Wed Jul 26 23:35:23 2017 10321 Locked Wed Jul 26 23:35:23 2017 10321 unLocked 10321 exiting at test.pl line 26. Wed Jul 26 23:35:23 2017 10319 Locked Wed Jul 26 23:35:23 2017 10319 unLocked 10319 exiting at test.pl line 26.

As you can see, the first child (10320) grab the lock and the second child (10321) have to wait. After 5 second (because my sleep) the first child release the lock and delete the .LCK file. Now the second child can acquire the lock "after wait" but the parent (10319), at the same time, grab the lock on the new .LCK file.

Removing the unlink() on the DESTROY subroutine solves the problem:

... 12147 got lock on Lock File *Log::Dispatch::FileRotate::LFH ... 12148 got lock after wait ... 12146 got lock after wait

Can I send you a pull request to address this bug?

Best regards, Emanuele

mschout commented 7 years ago

Yes, I see. No need to send a PR for that since its such a trivial change. I'll apply this fix today thanks.

mschout commented 7 years ago

I've fixed this in v1.28, and included an adapted version of the above test case as an "author" test