sonic-net / sonic-platform-daemons

Platform module daemons for SONiC
Other
25 stars 159 forks source link

[stormond] Added new dynamic field 'last_sync_time' to STATE_DB #535

Open assrinivasan opened 3 months ago

assrinivasan commented 3 months ago

Description

Currently the Storage monitoring Daemon does not provide any information on when it last synced attributes of each storage device onto the database. This PR adds new dynamic field called last_sync_time that shows when STORAGE_INFO fields for each disk was last synced to STATE_DB. The value is in Unix timestamp format.

HLD Change PR: https://github.com/sonic-net/SONiC/pull/1784

Motivation and Context

The motivation is that this field would provide additional context in case of repeated daemon crashes scenario ending up with the daemon status = FATAL Exited too quickly (process log may have details).

How Has This Been Tested?

Manually tested my changes on a DUT:

  1. Let the daemon run for 10 minutes and then stopped it to sim a FATAL scenario:
  2. Let some time pass:
root@sonic-device:/usr/local/bin# date +%s
1724487680
  1. Checked STATE_DB for when it was last updated and verified match with logs:
2024 Aug 24 07:55:50.392206 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Starting Storage Monitoring Daemon
2024 Aug 24 07:55:50.437050 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Device Model: ATP AF240GSTJA-AW1, Serial: 23090240-000257
2024 Aug 24 07:55:50.437789 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Polling Interval set to 60 seconds
2024 Aug 24 07:55:50.437789 str3-7060x6-64pe-2 INFO pmon#stormond[906]: FSIO JSON file Interval set to 360 seconds
2024 Aug 24 07:55:50.481670 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Firmware: 42A4SB6G, health: 100.0%, Temp: 27.0C, FS IO Reads: 95432, FS IO Writes: 18860
2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Latest FSIO Reads: 95432, Latest FSIO Writes: 18860
2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Disk IO Reads: 3,657,918 [1.87 TB], Disk IO Writes: 1,155,355 [591 GB], Reserved Blocks: 100.0
**2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Last successful sync time to STATE_DB: 1724486150.4809144**
2024 Aug 24 07:56:00.045910 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:56:00,044 INFO success: stormond entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024 Aug 24 07:56:50.483157 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Polling Interval set to 60 seconds
2024 Aug 24 07:56:50.483157 str3-7060x6-64pe-2 INFO pmon#stormond[906]: FSIO JSON file Interval set to 360 seconds
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Firmware: 42A4SB6G, health: 100.0%, Temp: 27.0C, FS IO Reads: 95432, FS IO Writes: 18971
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Latest FSIO Reads: 95432, Latest FSIO Writes: 18971
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Disk IO Reads: 3,657,918 [1.87 TB], Disk IO Writes: 1,155,357 [591 GB], Reserved Blocks: 100.0
**2024 Aug 24 07:56:50.531956 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Last successful sync time to STATE_DB: 1724486210.5309312**
2024 Aug 24 07:57:41.743749 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Caught signal 'SIGTERM'
2024 Aug 24 07:57:41.743841 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Syncing total and latest procfs reads and writes from STATE_DB to JSON file
2024 Aug 24 07:57:41.743841 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:57:41,743 INFO waiting for stormond to stop
2024 Aug 24 07:57:41.745303 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Exiting with SIGTERM
2024 Aug 24 07:57:41.745408 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Shutting down Storage Monitoring Daemon
2024 Aug 24 07:57:41.832795 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:57:41,832 WARN stopped: stormond (exit status 143)
root@sonic-device:/usr/local/bin# redis-cli -n 6 HGETALL "STORAGE_INFO|nvme0n1"
 1) "device_model"
 2) "ATP AF240GSTJA-AW1"
 3) "serial"
 4) "23090240-000257"
 5) "firmware"
 6) "42A4SB6G"
 7) "health"
 8) "100.0"
 9) "temperature"
10) "27.0"
11) "latest_fsio_reads"
12) "95432"
13) "latest_fsio_writes"
14) "18971"
15) "disk_io_reads"
16) "3,657,918 [1.87 TB]"
17) "disk_io_writes"
18) "1,155,357 [591 GB]"
19) "reserved_blocks"
20) "100.0"
21) "total_fsio_reads"
22) "95432"
23) "total_fsio_writes"
24) "18971"
25) "last_sync_time"
26) "1724486210.5309312"

Thus we can see that the storage attributes for the nvme0n1 disk have not been updated in (1724487680 - 1724486210.5309312) = 1469 seconds ~= 25 minutes.

assrinivasan commented 14 hours ago

/azpw run coverage.Azure.sonic-platform-daemons.build_test

mssonicbld commented 14 hours ago

/AzurePipelines run coverage.Azure.sonic-platform-daemons.build_test

azure-pipelines[bot] commented 14 hours ago
No pipelines are associated with this pull request.
assrinivasan commented 14 hours ago

@prgeor @saiarcot895 failure appears to be a false flag? Code Coverage for scripts/stormond is at 83% as of the latest commit:

assrinivasan@dev-vm:sonic-buildimage/src/sonic-platform-daemons$ git log -1 --oneline d5c5969 (HEAD -> stormon-add-last-sync-time, origin/stormon-add-last-sync-time) Increased log level so that they are seen in syslogs

image

Am I missing something?