Problem locking RRD in rare instances

bachandi commented 2 years ago

In rare instances (roughly once every few days) I noticed an error in the monitorix log files about not being able to lock the RRD. The exact file is quite random up to now.

ERROR: while updating /var/lib/monitorix/XXX.rrd: could not lock RRD

I have only one instance of monitorix running per machine but have enable_parallelizing = y. When looking in the source it is only parallelizing per plot collection so this should also not be the issue or is there another process also accessing the RRDs?

Any idea what could cause this error?

mikaku commented 2 years ago

I have seen this error only when the machine is running multiple instances of Monitorix, or when Monitorix takes too long to save data into the .rrd files, perhaps because these files reside in a network storage.

bachandi commented 2 years ago

I am pretty sure there is only one instance running and the RRD location is on a local SSD. I disabled parallelizing for the moment and will lookout if the error reappears.

My idea was a missing wait for the processes to finish but this should take care of all the processes: https://github.com/mikaku/Monitorix/blob/269b7ead91b338d28a0b74971f34db4eef7bd84a/monitorix.cgi#L646

Not sure what else could be the cause for this.

mikaku commented 2 years ago

I don't think parallelizing has problems, and I haven't seen such error before. People who reported to have such message had either one of the possible two causes I stated in my previous post.

bachandi commented 2 years ago

Yes and it is still happening without parallelizing so this is not the issue. I logged monitorix with ps aux and there are no other instances running. Will have to investigate further.

mikaku commented 2 years ago

Are all your .rrd files in the local filesystem?

bachandi commented 2 years ago

Are all your .rrd files in the local filesystem?

yes

mikaku commented 2 years ago

OK. Also, can you check if all timestamps in .rrd files are consistent?

bachandi commented 2 years ago

I investigated with rrdinfo and they looked ok but I doubt that is what you had in mind?

I noticed an uninitialized value error in process.pm but it is probably unrelated:

Thu Dec  9 18:24:06 2021 - ERROR: while updating /var/lib/monitorix/process.rrd: could not lock RRD
Use of uninitialized value in multiplication (*) at /usr/lib/monitorix/process.pm line 1120.
Use of uninitialized value in multiplication (*) at /usr/lib/monitorix/process.pm line 1121.
Fri Dec 10 11:20:07 2021 - ERROR: while updating /var/lib/monitorix/process.rrd: could not lock RRD

mikaku commented 2 years ago

Any news on this?

bachandi commented 2 years ago

It is still happens on multiple systems from time to time but could not narrow it down why it happens. I suspect we can close this for the moment till there is new information on it.

mikaku commented 2 years ago

OK, let's close it and will reopen it if something new comes. Thanks.

bachandi commented 2 years ago

Could this be related to XXX_update and XXX_cgi routines being called by different processes? To my understanding the update routines are called by the monitorix process and the cgi by the process called by the browser? If these two calls overlap by chance the update routine could get problems locking the rrd file if it read at the same moment by the cgi routine?

mikaku commented 2 years ago

The routines XXX_cgi only make read-only calls, so I don't think that this creates any locking problem.

I think that it would has been manifested before since there is a probability to coincide when monitorix.cgi is reading data from .rrd files and when monitorix is saving data. I'm thinking on my server which is used as the Monitorix demo and is used pretty often by people from everywhere, and I've never seen these locking errors.

bachandi commented 2 years ago

I still believe these locking errors are related to reading and writing at the same time. I tested it with a small shell script to write and read a nvme.rrd lookalike and encountered these locking errors when reading and writing at the same time.

A quick bash script(test_rrd_locking.sh) which emulates reading and writing from monitorix. This way we can test the rrd locking:

#! /usr/bin/bash 

RRDFILE=$1
SLEEPTIME_IN_S=$2
ACTION=$3

READ_COMMAND() {
  rrdtool graphv test01z.1day.svg --title=Title --start=-1day --imgformat=SVG --vertical-label=label --width=800 --height=300 --full-size-mode --zoom=1 --slope-mode DEF:hd0_smv0=${RRDFILE}:nvme0_hd0_smv0:AVERAGE DEF:hd1_smv0=${RRDFILE}:nvme0_hd1_smv0:AVERAGE DEF:hd2_smv0=${RRDFILE}:nvme0_hd2_smv0:AVERAGE DEF:hd3_smv0=${RRDFILE}:nvme0_hd3_smv0:AVERAGE DEF:hd4_smv0=${RRDFILE}:nvme0_hd4_smv0:AVERAGE DEF:hd5_smv0=${RRDFILE}:nvme0_hd5_smv0:AVERAGE DEF:hd6_smv0=${RRDFILE}:nvme0_hd6_smv0:AVERAGE DEF:hd7_smv0=${RRDFILE}:nvme0_hd7_smv0:AVERAGE CDEF:allvalues=hd0_smv0,UN,0,1,IF,hd1_smv0,UN,0,1,IF,hd2_smv0,UN,0,1,IF,hd3_smv0,UN,0,1,IF,hd4_smv0,UN,0,1,IF,hd5_smv0,UN,0,1,IF,hd6_smv0,UN,0,1,IF,hd7_smv0,UN,0,1,IF,+,+,+,+,+,+,+,0,GT,1,UNKN,IF CDEF:trans_hd0_smv0=hd0_smv0 CDEF:trans_hd1_smv0=hd1_smv0 CDEF:trans_hd2_smv0=hd2_smv0 CDEF:trans_hd3_smv0=hd3_smv0 CDEF:trans_hd4_smv0=hd4_smv0 CDEF:trans_hd5_smv0=hd5_smv0 CDEF:trans_hd6_smv0=hd6_smv0 CDEF:trans_hd7_smv0=hd7_smv0 CDEF:wrongdata=allvalues,UN,INF,UNKN,IF LINE2:trans_hd0_smv0#FFA500:/dev/nvme0 AREA:wrongdata#FFFFFF:
}

WRITE_COMMAND() {
  rrdtool update ${RRDFILE} N:38:100:0:16738106:0:20:32282126:0.0569779319774506:0:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN
}

CREATE_COMMAND() {
  rrdtool create ${RRDFILE} --step=60 DS:nvme0_hd0_smv0:GAUGE:120:0:U DS:nvme0_hd0_smv1:GAUGE:120:0:U DS:nvme0_hd0_smv2:GAUGE:120:0:U DS:nvme0_hd0_smv3:GAUGE:120:0:U DS:nvme0_hd0_smv4:GAUGE:120:0:U DS:nvme0_hd0_smv5:GAUGE:120:0:U DS:nvme0_hd0_smv6:GAUGE:120:0:U DS:nvme0_hd0_smv7:GAUGE:120:0:U DS:nvme0_hd0_smv8:GAUGE:120:0:U DS:nvme0_hd1_smv0:GAUGE:120:0:U DS:nvme0_hd1_smv1:GAUGE:120:0:U DS:nvme0_hd1_smv2:GAUGE:120:0:U DS:nvme0_hd1_smv3:GAUGE:120:0:U DS:nvme0_hd1_smv4:GAUGE:120:0:U DS:nvme0_hd1_smv5:GAUGE:120:0:U DS:nvme0_hd1_smv6:GAUGE:120:0:U DS:nvme0_hd1_smv7:GAUGE:120:0:U DS:nvme0_hd1_smv8:GAUGE:120:0:U DS:nvme0_hd2_smv0:GAUGE:120:0:U DS:nvme0_hd2_smv1:GAUGE:120:0:U DS:nvme0_hd2_smv2:GAUGE:120:0:U DS:nvme0_hd2_smv3:GAUGE:120:0:U DS:nvme0_hd2_smv4:GAUGE:120:0:U DS:nvme0_hd2_smv5:GAUGE:120:0:U DS:nvme0_hd2_smv6:GAUGE:120:0:U DS:nvme0_hd2_smv7:GAUGE:120:0:U DS:nvme0_hd2_smv8:GAUGE:120:0:U DS:nvme0_hd3_smv0:GAUGE:120:0:U DS:nvme0_hd3_smv1:GAUGE:120:0:U DS:nvme0_hd3_smv2:GAUGE:120:0:U DS:nvme0_hd3_smv3:GAUGE:120:0:U DS:nvme0_hd3_smv4:GAUGE:120:0:U DS:nvme0_hd3_smv5:GAUGE:120:0:U DS:nvme0_hd3_smv6:GAUGE:120:0:U DS:nvme0_hd3_smv7:GAUGE:120:0:U DS:nvme0_hd3_smv8:GAUGE:120:0:U DS:nvme0_hd4_smv0:GAUGE:120:0:U DS:nvme0_hd4_smv1:GAUGE:120:0:U DS:nvme0_hd4_smv2:GAUGE:120:0:U DS:nvme0_hd4_smv3:GAUGE:120:0:U DS:nvme0_hd4_smv4:GAUGE:120:0:U DS:nvme0_hd4_smv5:GAUGE:120:0:U DS:nvme0_hd4_smv6:GAUGE:120:0:U DS:nvme0_hd4_smv7:GAUGE:120:0:U DS:nvme0_hd4_smv8:GAUGE:120:0:U DS:nvme0_hd5_smv0:GAUGE:120:0:U DS:nvme0_hd5_smv1:GAUGE:120:0:U DS:nvme0_hd5_smv2:GAUGE:120:0:U DS:nvme0_hd5_smv3:GAUGE:120:0:U DS:nvme0_hd5_smv4:GAUGE:120:0:U DS:nvme0_hd5_smv5:GAUGE:120:0:U DS:nvme0_hd5_smv6:GAUGE:120:0:U DS:nvme0_hd5_smv7:GAUGE:120:0:U DS:nvme0_hd5_smv8:GAUGE:120:0:U DS:nvme0_hd6_smv0:GAUGE:120:0:U DS:nvme0_hd6_smv1:GAUGE:120:0:U DS:nvme0_hd6_smv2:GAUGE:120:0:U DS:nvme0_hd6_smv3:GAUGE:120:0:U DS:nvme0_hd6_smv4:GAUGE:120:0:U DS:nvme0_hd6_smv5:GAUGE:120:0:U DS:nvme0_hd6_smv6:GAUGE:120:0:U DS:nvme0_hd6_smv7:GAUGE:120:0:U DS:nvme0_hd6_smv8:GAUGE:120:0:U DS:nvme0_hd7_smv0:GAUGE:120:0:U DS:nvme0_hd7_smv1:GAUGE:120:0:U DS:nvme0_hd7_smv2:GAUGE:120:0:U DS:nvme0_hd7_smv3:GAUGE:120:0:U DS:nvme0_hd7_smv4:GAUGE:120:0:U DS:nvme0_hd7_smv5:GAUGE:120:0:U DS:nvme0_hd7_smv6:GAUGE:120:0:U DS:nvme0_hd7_smv7:GAUGE:120:0:U DS:nvme0_hd7_smv8:GAUGE:120:0:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:30:336 RRA:AVERAGE:0.5:60:744 RRA:AVERAGE:0.5:1440:365 RRA:MIN:0.5:1:1440 RRA:MIN:0.5:30:336 RRA:MIN:0.5:60:744 RRA:MIN:0.5:1440:365 RRA:MAX:0.5:1:1440 RRA:MAX:0.5:30:336 RRA:MAX:0.5:60:744 RRA:MAX:0.5:1440:365 RRA:LAST:0.5:1:1440 RRA:LAST:0.5:30:336 RRA:LAST:0.5:60:744 RRA:LAST:0.5:1440:365
}

ACTION_LOOP() {
  echo "performing ${ACTION} action to ${RRDFILE} with sleeptime of ${SLEEPTIME_IN_S}s"
  INDEX=0
  echo "testing ${ACTION} ${INDEX}"
  ((++INDEX))
  ($1)
  while [ $? -eq 0 ]; do
    sleep ${SLEEPTIME_IN_S}
    echo "testing ${ACTION} ${INDEX}"
    ((++INDEX))
    ($1)
  done
  echo "aborted ${ACTION} action at $INDEX"
}

case "${ACTION}" in
  create)
    echo "creating ${RRDFILE}"
    CREATE_COMMAND
    echo "Done"
    ;;
  read)
    ACTION_LOOP READ_COMMAND
    ;;
  write)
    ACTION_LOOP WRITE_COMMAND
    ;;
  *)
    echo "usage: $0 <filename> <sleeptime> {create|read|write}"
    exit 1
    ;;
 esac

 exit 0

Reproducing the locking error:

Create an rrd file: $ ./test_rrd_locking.sh test.rrd 0 create
Writing data to rrd file every 0.051s: $ ./test_rrd_locking.sh test.rrd 0.051 write
While the writing script is still running read the rrd file every 0.05s: $ ./test_rrd_locking.sh test.rrd 0.05 read

After ~(10-200) iterations I end up getting the locking error either on reading or writing. Running just reading or just writing worked flawlessly for millions of iterations. But when doing it at the same time the locking errors pop up again.

@mikaku maybe this way you are able to reproduce the rrd locking error?

mikaku commented 2 years ago

Yes, I've been running the script and I was able to reproduce the locking error. I think though that the way how this script runs and stresses the RRD read/write mechanism is far from what I've seen in normal Monitorix installations.

There is a mail thread where Tobias Oetiker, the author of RRDtool, comments that the file locking mechanism in RRDtool handles concurrent reads and writes.

Nevertheless if, for an unknown reason, your system is accessing .rrd files at this so high rate, maybe you could use the rrdcached. I've never used before, but I think that Monitorix would need some tweaking to include support for it.

This might well worth to give a try.

bachandi commented 2 years ago

Yes, I've been running the script and I was able to reproduce the locking error. I think though that the way how this script runs and stresses the RRD read/write mechanism is far from what I've seen in normal Monitorix installations.

Yes, Monitorix will not encounter these frequencies usually it was just a test to check if reading from and writing to rrd in the way Monitorix does it at the moment can cause this issue.

My assumption at the moment is that it occurs by chance if the browser initiated cgi call and monitorix initiated update call overlap.

I have the monitorix page open continuously which increases the chance of overlapping calls to the rrd. I experience these locking errors from Monitorix roughly every other day and most of the time when this happens some plots get some dips which destroys the axis scaling till it is out of range again. And one has to check if is an actual indication of an hard/software problem or just an artefact from the locking error.

There is a mail thread where Tobias Oetiker, the author of RRDtool, comments that the file locking mechanism in RRDtool handles concurrent reads and writes.

I have read it but I am not sure if he is referring to rrdchached, rrd files in general or that the user has to handle the locks by them self and rrdtool just provides the check if the file is locked.

Nevertheless if, for an unknown reason, your system is accessing .rrd files at this so high rate, maybe you could use the rrdcached. I've never used before, but I think that Monitorix would need some tweaking to include support for it.

I am not concerned about this happening at high rates but see it also happening at "normal" rates (just having the page open in a browser) and for everyone it is just a matter of time till anyone encounters this error as it happens by chance because of unprotected read/write access to the rrd files.

In principle if the error occurs during reading it is not that big of a problem as one can just refresh the page and get a new plot which if now probably correct. But if the error occurs during writing the rrd it might introduce wrongly missing data to the rrd which will stay there forever.

rrdcached would be one fix but I also don't have experience with it up to now. Would have to look into it.

Another fix would be to check when reading or writing if the file is locked, if is is locked wait a bit and try again.

Something like this:

my$ number_of_rrd_retries = 3;
for (my $i_retry = 0; $i_retry < number_of_rrd_retries; $i_retry++) {
    RRDs::update($rrd, $rrdata);
    logger("$myself: $rrdata") if $debug;
    my $err = RRDs::error;
    if ($err) {
        if (index($err, "could not lock RRD") != -1) {
            if ($i_retry + 1 < number_of_rrd_retries) {
                logger("WARNING: will retry locking because while updating $rrd: $err");
                usleep(1000);
            } else {
                logger("ERROR: reached maximum retries for locking $rrd: $err");
                return;
            }
        } else {
            logger("ERROR: while updating $rrd: $err");
            return;
        }
    } else {
        return;
    }
}

Maybe there is a better way to check of a locking error besides parsing for could not lock RRD.

mikaku commented 2 years ago

I have the monitorix page open continuously which increases the chance of overlapping calls to the rrd. I experience these locking errors from Monitorix roughly every other day and most of the time when this happens some plots get some dips which destroys the axis scaling till it is out of range again. And one has to check if is an actual indication of an hard/software problem or just an artefact from the locking error.

Yes, and if someone reduces the value of refresh_rate which is by default 150 (seconds) then there will be even more chances to have that locking issue.

Maybe there is a better way to check of a locking error besides parsing for could not lock RRD.

I was just thinking in flock. We could introduce a flock($fh, LOCK_EX) (exclusive lock) right before calling the uptdate functions of all graphs, here: https://github.com/mikaku/Monitorix/blob/ff9b04d8e4bcbf1c9d58dd99b1f9e2e33b004b08/monitorix#L768

and a flock($fh, LOCK_UN) (unlock) here: https://github.com/mikaku/Monitorix/blob/ff9b04d8e4bcbf1c9d58dd99b1f9e2e33b004b08/monitorix#L788

Also we will have to include the same two flocks in monitorix.cgi right before calling the cgi functions of all graphs:

here: https://github.com/mikaku/Monitorix/blob/ff9b04d8e4bcbf1c9d58dd99b1f9e2e33b004b08/monitorix.cgi#L580

and here: https://github.com/mikaku/Monitorix/blob/ff9b04d8e4bcbf1c9d58dd99b1f9e2e33b004b08/monitorix.cgi#L655

The advantage of flock(2) over an sleep time is that a process trying to acquire the exclusive lock will block until it's unlocked by the other process.

mikaku commented 2 years ago

Forgot to say that the file that will be used for flock can be a temporary file in /tmp/. Also, emailreports module might have also the same flock as it accesses the .rrd files when sending the reports, although this has a very low access rate.

bachandi commented 2 years ago

I was just thinking in flock. We could introduce a flock($fh, LOCK_EX) (exclusive lock) right before calling the uptdate functions of all graphs, here:

Yes this would be better than sleeping.

bachandi commented 2 years ago

It seems rrdtool has some kind of flock but somehow it does not lock in the monitorix use case: https://github.com/oetiker/rrdtool-1.x/blob/6b05d94486a478b91724c6babfbcc565e362fe46/src/rrd_open.c#L796

We could also use the rddfile for flock. I tested this in a slightly modified version of the above test script by adding flock to the read and write commad.

Replace the above read and write commands with:

READ_COMMAND() {
  flock -s -F ${RRDFILE} -c "rrdtool graphv test01z.1day.svg --title=Title --start=-1day --imgformat=SVG --vertical-label=label --width=800 --height=300 --full-size-mode --zoom=1 --slope-mode DEF:hd0_smv0=${RRDFILE}:nvme0_hd0_smv0:AVERAGE DEF:hd1_smv0=${RRDFILE}:nvme0_hd1_smv0:AVERAGE DEF:hd2_smv0=${RRDFILE}:nvme0_hd2_smv0:AVERAGE DEF:hd3_smv0=${RRDFILE}:nvme0_hd3_smv0:AVERAGE DEF:hd4_smv0=${RRDFILE}:nvme0_hd4_smv0:AVERAGE DEF:hd5_smv0=${RRDFILE}:nvme0_hd5_smv0:AVERAGE DEF:hd6_smv0=${RRDFILE}:nvme0_hd6_smv0:AVERAGE DEF:hd7_smv0=${RRDFILE}:nvme0_hd7_smv0:AVERAGE CDEF:allvalues=hd0_smv0,UN,0,1,IF,hd1_smv0,UN,0,1,IF,hd2_smv0,UN,0,1,IF,hd3_smv0,UN,0,1,IF,hd4_smv0,UN,0,1,IF,hd5_smv0,UN,0,1,IF,hd6_smv0,UN,0,1,IF,hd7_smv0,UN,0,1,IF,+,+,+,+,+,+,+,0,GT,1,UNKN,IF CDEF:trans_hd0_smv0=hd0_smv0 CDEF:trans_hd1_smv0=hd1_smv0 CDEF:trans_hd2_smv0=hd2_smv0 CDEF:trans_hd3_smv0=hd3_smv0 CDEF:trans_hd4_smv0=hd4_smv0 CDEF:trans_hd5_smv0=hd5_smv0 CDEF:trans_hd6_smv0=hd6_smv0 CDEF:trans_hd7_smv0=hd7_smv0 CDEF:wrongdata=allvalues,UN,INF,UNKN,IF LINE2:trans_hd0_smv0#FFA500:/dev/nvme0 AREA:wrongdata#FFFFFF:"
}

WRITE_COMMAND() {
  flock -F ${RRDFILE} -c "rrdtool update ${RRDFILE} N:38:100:0:16738106:0:20:32282126:0.0569779319774506:0:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN:NaN"
}

This way the test script also works with high rates.

We could add these routines to Monitorix.pm:

sub flocked_rrd_update {
    my ($myself, $debug, $rrd, $rrdata) = @_;
    open(my $rrd_fh, ">>", $rrd) or die "Can't open $rrd: $!";
    flock($rrd_fh, LOCK_EX) or die "Cannot lock rrd $!\n";
    RRDs::update($rrd, $rrdata);
    logger("$myself: $rrdata") if $debug;
    my $err = RRDs::error;
    logger("ERROR: while updating $rrd: $err") if $err;
    flock($rrd_fh, LOCK_UN) or die "Cannot unlock rrd - $!\n";
}

sub flocked_rrd_fetch {
    my ($rrd, $fetch_args) = @_;
    open(my $rrd_fh, ">>", $rrd) or die "Can't open $rrd: $!";
    flock($rrd_fh, LOCK_SH) or die "Cannot lock rrd $!\n";
    my (undef, undef, undef, $data) = RRDs::fetch("$rrd", $fetch_args);
    my $err = RRDs::error;
    flock($rrd_fh, LOCK_UN) or die "Cannot unlock rrd - $!\n";
    return ($data, $err);
}

sub flocked_rrd_call {
    my ($rrd, $rrd_call, $call_args) = @_;
    open(my $rrd_fh, ">>", $rrd) or die "Can't open $rrd: $!";
    flock($rrd_fh, LOCK_SH) or die "Cannot lock rrd $!\n";
    my $rrd_return_data = $rrd_call->($call_args);
    my $err = RRDs::error;
    flock($rrd_fh, LOCK_UN) or die "Cannot unlock rrd - $!\n";
    return ($rrd_return_data, $err);
}

And replace the update RRDs calls:

RRDs::update($rrd, $rrdata);
logger("$myself: $rrdata") if $debug;
my $err = RRDs::error;
logger("ERROR: while updating $rrd: $err") if $err;

to: flocked_rrd_update($myself, $debug, $rrd, $rrdata); The fetch call:

my (undef, undef, undef, $data) = RRDs::fetch("$rrd",
    "--resolution=$tf->{res}",
    "--start=-$tf->{nwhen}$tf->{twhen}",
    "AVERAGE");
$err = RRDs::error;

would be replaced by: my ($data, $err) = flocked_rrd_fetch($rrd, ("--resolution=$tf->{res}", "--start=-$tf->{nwhen}$tf->{twhen}", "AVERAGE")); and the graphv call would be replaced by:

my $err;
($pic, $err) = flocked_rrd_call($rrd, $rrd{$version}, ("$IMG_DIR" . $IMG[$e * $plots_per_list_item + $n_plot],
"--title=$plot_title ($tf->{nwhen}$tf->{twhen})",
"--start=-$tf->{nwhen}$tf->{twhen}",
"--imgformat=$imgfmt_uc",
"--vertical-label=" . $y_axis_title,
"--width=$width",
"--height=$height",
@extra,
@riglim,
$zoom,
@{$cgi->{version12}},
$large_plot ? () : @{$cgi->{version12_small}},
@{$colors->{graph_colors}},
@def_sensor_average,
$cdef_sensor_allvalues,
@CDEF,
@tmp)
);

But using the rrd as file for flock requires that the cgi routine has write access to the rrd file to be able to lock it. Not sure if this is desired.

Maybe one global lock file in /tmp/ like you proposed with write access for both calls is better?

mikaku commented 2 years ago

But using the rrd as file for flock requires that the cgi routine has write access to the rrd file to be able to lock it. Not sure if this is desired.

That's not possible. The .rrd files are owned by the root user and the cgi program runs under a regular user (usually nobody) for security reasons. The owner of the cgi program might change if the user switches from the built-in Monitorix HTTP server to an Apache web server. It would be complicated to manage also these user changes in the .rrd files.

Maybe one global lock file in /tmp like you proposed with write access for both calls is better?

Both solutions have advantages and disadvantages. If we choose the global lock path the changes in the code are minimal and the .rrd files and cgi continue as now. In the other case, if we choose you suggestion we have better lock granularity but this is incompatible with the owner of .rrd files and the user of cgi.

Since this is something that almost no user is experiencing, I'd choose the global lock path and it should rely on a new option which would be disabled by default. Something like enable_rrd_lock?

bachandi commented 2 years ago

Since this is something that almost no user is experiencing, I'd choose the global lock path and it should rely on a new option which would be disabled by default. Something like enable_rrd_lock?

Agreed, that would be the best.

mikaku / Monitorix

Problem locking RRD in rare instances #370