scalyr / scalyr-agent-2

The source code for Scalyr Agent 2, the daemon process Scalyr customers run on their servers to collect metrics and logs.
Apache License 2.0
71 stars 56 forks source link

Fix - Race condition when trying to add a log file that's been scheduled for removal #1302

Closed alesnovak-s1 closed 1 month ago

alesnovak-s1 commented 1 month ago

Based on the comments the lock was initially meant to lock status fields used in generate_status. Now it ensures synchronized access to fields needed for asynchonous workflow of log files. This is already causing a timeout when retrieving status from the agent (https://sentinelone.atlassian.net/browse/DTIN-4769) and will be handled in another PR.

The Bug:

When a log file (path, worker_id) is scheduled for removal, the removal is done in two separate steps:

The race condition can be reached when a path is scheduled for removal and the matchers are still finishing. When add_log_config is called it considers log path to be active and skips adding it. The log file is later removed.

In log we can see sequence like this one:

Log ( log_path='xy', worker_id='default' ) for monitor 'scalyr_agent.builtin_monitors.kubernetes_monitor' is pending removal

Tried to add new log file 'xy' for monitor 'scalyr_agent.builtin_monitors.kubernetes_monitor', but it is already being monitored by 'scalyr_agent.builtin_monitors.kubernetes_monitor'

Removing log file ( log_path='xy', worker_id='default' ) for 'scheduled-deletion'
github-actions[bot] commented 1 month ago

Test Results

     20 files  ±  0       20 suites  ±0   31m 28s :stopwatch: -11s 1 497 tests +  8  1 477 :heavy_check_mark: +  8    20 :zzz: ±0  0 :x: ±0  7 156 runs  +40  6 929 :heavy_check_mark: +40  227 :zzz: ±0  0 :x: ±0 

Results for commit 423aea8d. ± Comparison against base commit c189190a.

:recycle: This comment has been updated with latest results.