deleted log file race condition with flushing

sandyharvie commented 7 months ago

Create a logger L1 that logs to file F1 and a logger L2 that logs to file F2.
Log using both L1 and L2.
Cease logging with L1, flush, delete file F1 and its parent directory, then trigger the asynchronous removal of L1.
Flush again. If this flush event is processed before L1's handler for F1 is determined to be "inactive," then the logic added in this MR will cause an exception to be thrown.

A potential fix for this would be to have BackendWorker::_force_flush() check for invalidated loggers and recompute the set of "active" handlers before performing the actual flushing. I'd be happy to submit an MR for this, if you think that sounds reasonable.

Originally posted by @sandyharvie in https://github.com/odygrd/quill/issues/267#issuecomment-1928368172

odygrd commented 5 months ago

Hey, sorry for the late response.

How are L' and F' related ? The 4th bullet point is a bit confusing. Is it L''s handler or L's handler. If it is the former it should be for F' but F is mentioned

We can make the logic in the MR that reopens the deleted file optional, but I would like to understand the problem first.

Would you be able to submit a small code sample that could potentially reproduce this ?

sandyharvie commented 5 months ago

I've edited the original comment to be more clear. A user should, in my opinion, be able to safely perform the following sequence of events:

Create a temporary directory, then create a file in that temporary directory
Create a logger that will log to that file for some period of time
Flush, to take care of any pending log statements for that file
Delete the temporary directory
Delete the logger
Create a new logger that logs to a different file
Log using this new logger, then flush

This is currently not guaranteed to work, though. The flush event created in step 7 can end up getting processed before the first logger is invalidated, which means the first logger's handler will still be considered "active" and therefore have its flush() function invoked as part of the processing of this flush event. This invocation of flush() will see that the file no longer exists, then fail trying to create it again.

The fundamental issue here, as I see it, is that the set of "active" handlers is not recomputed when processing a flush event. I understand that this is an expensive operation (as documented here), but it seems like it'd be worth the cost given that flushing is already a slow operation.

odygrd commented 5 months ago

Thanks, I see what you mean.

The Handlers can be shared between loggers, so you can have one invalidated logger and a valid logger using the same handler.

Even if we check inside _force_flush() for non invalidated loggers filter only the active handlers and flush them, then a logger can still be invalidated right after the filtering just before we call flush() on the handler.

I think if you change the sequence in step 5 it should work fine in the current version.

First flush and invalidate the logger.
After the logger is removed then the associated file fd and handler will be closed so you can safely remove the file

I will try to reproduce it first in the following way

a. logger_a logs then flush b. Manually delete file_a.log c. Log then flush from logger_b which will now try to also flush the deleted file_a from logger_a

If the above reproduces it, then maybe we can check if the file exists before calling flush on it on the handler. And we can also check for invalidated logger in _force_flush()

There still will be a race condition eg file is removed after the existence check just before flushing but probably both those changes will mitigate it making it more rare to happen

What are your thoughts?

odygrd commented 5 months ago

I have tried to reproduce something like this

#include "quill/Quill.h"

#include <iostream>

/**
 * Logging to the same file using multiple loggers
 */

int main()
{
  // Start the backend logging thread
  quill::start();

  // Log using the root logger
  std::shared_ptr<quill::Handler> file_handler_a =
    quill::file_handler("example_trivial_a.log",
                        []()
                        {
                          quill::FileHandlerConfig cfg;
                          cfg.set_open_mode('w');
                          cfg.set_append_to_filename(quill::FilenameAppend::None);
                          return cfg;
                        }());
  quill::Logger* logger_a = quill::create_logger("logger_a", std::move(file_handler_a));

  std::shared_ptr<quill::Handler> file_handler_b =
    quill::file_handler("example_trivial_b.log",
                        []()
                        {
                          quill::FileHandlerConfig cfg;
                          cfg.set_open_mode('w');
                          cfg.set_append_to_filename(quill::FilenameAppend::None);
                          return cfg;
                        }());
  quill::Logger* logger_b = quill::create_logger("logger_b", std::move(file_handler_b));

  for (size_t i = 0; i < 1000; ++i)
  {
    QUILL_LOG_INFO(logger_a, "log something {}", i);
  }

  quill::flush();

  std::cout << "Delete example_trivial_a.log file now " << std::endl;
  std::this_thread::sleep_for(std::chrono::seconds{10});

  for (size_t i = 0; i < 10; ++i)
  {
    QUILL_LOG_INFO(logger_b, "log something {}", i);
  }
  quill::flush();
}

I see the exception it is because the directory doesn't exist anymore.. Otherwise flush will recreate an empty file

sandyharvie commented 5 months ago

Yeah, you need to delete the file and the directory in which it was created.

sandyharvie commented 5 months ago

I think if you change the sequence in step 5 it should work fine in the current version.

First flush and invalidate the logger.

After the logger is removed then the associated file fd and handler will be closed so you can safely remove the file

AFAIK, the current API does not allow me to check whether the logger has been removed. I can only trigger an asynchronous removal and expect that it will eventually be removed, so I am not sure how to perform step 2 here reliably.

There still will be a race condition eg file is removed after the existence check just before flushing but probably both those changes will mitigate it making it more rare to happen

The race condition you describe can only happen if the removal of the file's directory is performed after or concurrent to the submission of the second flush event. If it is performed before the submission of the second flush event, as is the case in my example and your code above, then the file's deletion will necessarily be detected while processing the flush event.

What are your thoughts?

Making logger removal synchronous would be a clean solution, but would likely be a more arduous addition than simply checking for invalidated loggers in BackendWorker::_force_flush(). I also think making recreation of the file optional would be a good addition to the API. I am happy to submit a PR for this, so just let me know what you'd prefer.

odygrd commented 5 months ago

Thanks for reporting this. I found a few more things to improve while looking at this. It should not happen again after this PR

odygrd / quill

deleted log file race condition with flushing #395