odygrd / quill

Asynchronous Low Latency C++ Logging Library
MIT License
1.36k stars 142 forks source link

deleted log file race condition with flushing #395

Closed sandyharvie closed 5 months ago

sandyharvie commented 7 months ago

A potential fix for this would be to have BackendWorker::_force_flush() check for invalidated loggers and recompute the set of "active" handlers before performing the actual flushing. I'd be happy to submit an MR for this, if you think that sounds reasonable.

Originally posted by @sandyharvie in https://github.com/odygrd/quill/issues/267#issuecomment-1928368172

odygrd commented 5 months ago

Hey, sorry for the late response.

How are L' and F' related ? The 4th bullet point is a bit confusing. Is it L''s handler or L's handler. If it is the former it should be for F' but F is mentioned

We can make the logic in the MR that reopens the deleted file optional, but I would like to understand the problem first.

Would you be able to submit a small code sample that could potentially reproduce this ?

sandyharvie commented 5 months ago

I've edited the original comment to be more clear. A user should, in my opinion, be able to safely perform the following sequence of events:

  1. Create a temporary directory, then create a file in that temporary directory
  2. Create a logger that will log to that file for some period of time
  3. Flush, to take care of any pending log statements for that file
  4. Delete the temporary directory
  5. Delete the logger
  6. Create a new logger that logs to a different file
  7. Log using this new logger, then flush

This is currently not guaranteed to work, though. The flush event created in step 7 can end up getting processed before the first logger is invalidated, which means the first logger's handler will still be considered "active" and therefore have its flush() function invoked as part of the processing of this flush event. This invocation of flush() will see that the file no longer exists, then fail trying to create it again.

The fundamental issue here, as I see it, is that the set of "active" handlers is not recomputed when processing a flush event. I understand that this is an expensive operation (as documented here), but it seems like it'd be worth the cost given that flushing is already a slow operation.

odygrd commented 5 months ago

Thanks, I see what you mean.

The Handlers can be shared between loggers, so you can have one invalidated logger and a valid logger using the same handler.

Even if we check inside _force_flush() for non invalidated loggers filter only the active handlers and flush them, then a logger can still be invalidated right after the filtering just before we call flush() on the handler.

I think if you change the sequence in step 5 it should work fine in the current version.

  1. First flush and invalidate the logger.
  2. After the logger is removed then the associated file fd and handler will be closed so you can safely remove the file

I will try to reproduce it first in the following way

a. logger_a logs then flush b. Manually delete file_a.log c. Log then flush from logger_b which will now try to also flush the deleted file_a from logger_a

If the above reproduces it, then maybe we can check if the file exists before calling flush on it on the handler. And we can also check for invalidated logger in _force_flush()

There still will be a race condition eg file is removed after the existence check just before flushing but probably both those changes will mitigate it making it more rare to happen

What are your thoughts?

odygrd commented 5 months ago

I have tried to reproduce something like this

#include "quill/Quill.h"

#include <iostream>

/**
 * Logging to the same file using multiple loggers
 */

int main()
{
  // Start the backend logging thread
  quill::start();

  // Log using the root logger
  std::shared_ptr<quill::Handler> file_handler_a =
    quill::file_handler("example_trivial_a.log",
                        []()
                        {
                          quill::FileHandlerConfig cfg;
                          cfg.set_open_mode('w');
                          cfg.set_append_to_filename(quill::FilenameAppend::None);
                          return cfg;
                        }());
  quill::Logger* logger_a = quill::create_logger("logger_a", std::move(file_handler_a));

  std::shared_ptr<quill::Handler> file_handler_b =
    quill::file_handler("example_trivial_b.log",
                        []()
                        {
                          quill::FileHandlerConfig cfg;
                          cfg.set_open_mode('w');
                          cfg.set_append_to_filename(quill::FilenameAppend::None);
                          return cfg;
                        }());
  quill::Logger* logger_b = quill::create_logger("logger_b", std::move(file_handler_b));

  for (size_t i = 0; i < 1000; ++i)
  {
    QUILL_LOG_INFO(logger_a, "log something {}", i);
  }

  quill::flush();

  std::cout << "Delete example_trivial_a.log file now " << std::endl;
  std::this_thread::sleep_for(std::chrono::seconds{10});

  for (size_t i = 0; i < 10; ++i)
  {
    QUILL_LOG_INFO(logger_b, "log something {}", i);
  }
  quill::flush();
}

I see the exception it is because the directory doesn't exist anymore.. Otherwise flush will recreate an empty file

sandyharvie commented 5 months ago

Yeah, you need to delete the file and the directory in which it was created.

sandyharvie commented 5 months ago

I think if you change the sequence in step 5 it should work fine in the current version.

  1. First flush and invalidate the logger.
  2. After the logger is removed then the associated file fd and handler will be closed so you can safely remove the file

AFAIK, the current API does not allow me to check whether the logger has been removed. I can only trigger an asynchronous removal and expect that it will eventually be removed, so I am not sure how to perform step 2 here reliably.

There still will be a race condition eg file is removed after the existence check just before flushing but probably both those changes will mitigate it making it more rare to happen

The race condition you describe can only happen if the removal of the file's directory is performed after or concurrent to the submission of the second flush event. If it is performed before the submission of the second flush event, as is the case in my example and your code above, then the file's deletion will necessarily be detected while processing the flush event.

What are your thoughts?

Making logger removal synchronous would be a clean solution, but would likely be a more arduous addition than simply checking for invalidated loggers in BackendWorker::_force_flush(). I also think making recreation of the file optional would be a good addition to the API. I am happy to submit a PR for this, so just let me know what you'd prefer.

odygrd commented 5 months ago

Thanks for reporting this. I found a few more things to improve while looking at this. It should not happen again after this PR