Closed masterwishx closed 1 week ago
Hi,
Please post your app.log file of when the DBCLNP
plugin is run. I want to check if it's executed without issues. You can also try to execute the DB cleanup plugin manually:
Also please post the stats of which tables have the most records, e.g.:
DBCLNP
i runned dbcleanup ...
i found it has error in log whencleanup :
09:01:29 [DBCLNP] In script
09:01:29 [DBCLNP] Upkeep Database:
09:01:29 [DBCLNP] Online_History: Delete all but keep latest 150 entries
09:01:29 [DBCLNP] Events: Delete all older than 90 days (DAYS_TO_KEEP_EVENTS setting)
09:01:29 [DBCLNP] Plugins_History: Trim Plugins_History entries to less than 10000 per Plugin (PLUGINS_KEEP_HIST setting)
09:01:44 [DBCLNP] Plugins_History: Trim Notifications entries to less than 100
09:01:44 [DBCLNP] Trim AppEvents to less than 5000
09:01:44 [DBCLNP] Pholus_Scan: Delete all older than 7 days (PHOLUS_DAYS_DATA setting)
09:01:44 [DBCLNP] Pholus_Scan: Delete all duplicates
09:01:44 [DBCLNP] Plugins_Objects: Delete all duplicates
09:01:59 [Plugins] ⚠ ERROR - TIMEOUT - the plugin DBCLNP forcefully terminated as timeout reached. Increase TIMEOUT setting and scan interval.
09:01:59 [Plugins] No output received from the plugin DBCLNP - enable LOG_LEVEL=debug and check logs
@jokob-sk so enabled debug and seems it was not enough time to clear db , so increased from 30s to 300s , so db size now 50mb but app.db-wal
is still 1.6gb , should i set back to 30s ?
it seems still high usage in wait mode ,maybe because of app.db-wal
is still 1.6gb :
Hi @masterwishx ,
Glad to see there's progress. According to ChatGPT, some conditions need to be met for the .wal file to shrink. As I read this, this should happen automatically after some time. You can try to restart the container and not to access the UI for a bit so no unnecessary connections open.
If the size doesn't go down within a few hours/a day, I'd consider rebuilding the container with the CSV + setting file approach (🧭 Backup strategies -> Scenario 2: Corrupted database), but please keep backups:
https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md
I hope this will then reduce CPU utilization significantly.
Checkpointing:
The WAL file shrinks when a checkpoint occurs. A checkpoint is the process of transferring data from the WAL file back to the main database file. This process can be triggered manually (using the PRAGMA wal_checkpoint command) or automatically when the WAL file reaches a certain size.
Completion of Checkpointing:
For the WAL file to shrink, the checkpoint process must successfully transfer all the committed transactions to the main database file. If there are still active readers or writers on the database, the checkpoint cannot fully complete, and the WAL file will not shrink.
Truncation:
After a checkpoint, if all transactions have been transferred, the WAL file may be truncated to zero or a smaller size. This depends on the mode of checkpointing and the settings of the SQLite database.
Closing the Database Connection:
In some cases, when the last connection to the database is closed, SQLite will attempt to shrink the WAL file as part of the cleanup process.
Reaching a Threshold:
SQLite may also automatically initiate a checkpoint when the WAL file reaches a certain size (default is 1000 pages or about 4MB). If this threshold is crossed and a checkpoint is successfully performed, the WAL file can shrink.
Hi @masterwishx ,
Glad to see there's progress. According to ChatGPT, some conditions need to be met for the .wal file to shrink. As I read this, this should happen automatically after some time. You can try to restart the container and not to access the UI for a bit so no unnecessary connections open.
If the size doesn't go down within a few hours/a day, I'd consider rebuilding the container with the CSV + setting file approach (🧭 Backup strategies -> Scenario 2: Corrupted database), but please keep backups:
https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md
I hope this will then reduce CPU utilization significantly.
Checkpointing: The WAL file shrinks when a checkpoint occurs. A checkpoint is the process of transferring data from the WAL file back to the main database file. This process can be triggered manually (using the PRAGMA wal_checkpoint command) or automatically when the WAL file reaches a certain size. Completion of Checkpointing: For the WAL file to shrink, the checkpoint process must successfully transfer all the committed transactions to the main database file. If there are still active readers or writers on the database, the checkpoint cannot fully complete, and the WAL file will not shrink. Truncation: After a checkpoint, if all transactions have been transferred, the WAL file may be truncated to zero or a smaller size. This depends on the mode of checkpointing and the settings of the SQLite database. Closing the Database Connection: In some cases, when the last connection to the database is closed, SQLite will attempt to shrink the WAL file as part of the cleanup process. Reaching a Threshold: SQLite may also automatically initiate a checkpoint when the WAL file reaches a certain size (default is 1000 pages or about 4MB). If this threshold is crossed and a checkpoint is successfully performed, the WAL file can shrink.
Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage
cackup instead backup
Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage cackup instead backup
haha, thanks, fixed
Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage cackup instead backup
haha, thanks, fixed
When i trying restore last backup i have in log :
i removed db files manually and copied from backup , but seems still high usage on CPU :
it it right permissions ?
Hi @masterwishx ,
You can check the permissions by following this guide: https://github.com/jokob-sk/NetAlertX/blob/main/docs/FILE_PERMISSIONS.md
You can check in the log what is happening during those CPU spikes. It might be as simple as decreasing the frequency of scans or disabling certain plugins.
Hi @masterwishx ,
You can check the permissions by following this guide: https://github.com/jokob-sk/NetAlertX/blob/main/docs/FILE_PERMISSIONS.md
You can check in the log what is happening during those CPU spikes. It might be as simple as decreasing the frequency of scans or disabling certain plugins.
Not sure if related can't find exactly time from Grafana docker Telegraf but with debug i found a lot of :
Well, I made new container in Unraid and added all config manually and device csv it seems OK now, not shure what was a problem but was a lot of io in Gb also...
I think I can close for now, maybe you can comment some think for Import config
?
Thasnk for the update. You can set the LOG_LEVEL
to NONE
or MINIMAL
and these logs should disappear.
Thasnk for the update. You can set the
LOG_LEVEL
toNONE
orMINIMAL
and these logs should disappear.
This can't increase use of cpu?
Hmmm, not sure, haven't tried/monitored the app that closely. It might be worth a try to minimize the log writes by adjusting the setting to a less verbose value, such as MINIMAL
Hmmm, not sure, haven't tried/monitored the app that closely. It might be worth a try to minimize the log writes by adjusting the setting to a less verbose value, such as
MINIMAL
Hi again i have same issue :
it seems 7970 events in db but in setting 5000 events:
Hi @masterwishx ,
it seems 7970 events in db but in setting 5000 events:
There is a delay in the cleanup script so this is expected. Bu default it runs every 30 minutes.
Also, what monitoring tool do you use, I might set it up myself :)
usual values ?
Scanners:
app.log now 18mb
23:33:50 [Database] Opening DB
23:33:50 [AVAHISCAN] Unknown devices count: 3
23:33:50 [AVAHISCAN] Attempt 1 - Ensuring D-Bus and Avahi daemon are running...
23:33:50 [AVAHISCAN] Avahi Daemon is already running.
23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.140']
23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.176'] 23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.164'] 23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] Script finished
5. on spike in `23:31:05` CPU = 95% :
23:30:53 [DBCLNP] Trim AppEvents to less than 5000 23:30:53 [DBCLNP] Pholus_Scan: Delete all older than 30 days (PHOLUS_DAYS_DATA setting) 23:30:53 [DBCLNP] Pholus_Scan: Delete all duplicates 23:30:53 [DBCLNP] Plugins_Objects: Delete all duplicates 23:31:14 [DBCLNP] Shrink Database 23:31:14 [DBCLNP] Cleanup complete 23:31:14 [Plugins] No output received from the plugin "DBCLNP" 23:31:14 [Process Scan] Exclude ignored devices 23:31:14 [Process Scan] Processing scan results 23:31:14 [Process Scan] Print Stats
`23:43:24` CPU =25%
23:43:27 [Update Device Name] Names Not Found : 3 23:43:27 [Notification] Check if something to report 23:43:27 [Notification] Included sections: ['new_devices', 'down_devices', 'events'] 23:43:27 [Notification] No changes to report 23:43:27 [MAIN] Process: Wait 23:44:30 [Plugin utils] --------------------------------------------- 23:44:30 [Plugin utils] display_name: AVAHISCAN (Name discovery) 23:44:30 [Plugins] Executing: python3 /app/front/plugins/avahi_scan/avahi_scan.py 23:44:30 [AVAHISCAN] In script
23:31:05 CPU =95%
23:30:53 [DBCLNP] Trim AppEvents to less than 5000 23:30:53 [DBCLNP] Pholus_Scan: Delete all older than 30 days (PHOLUS_DAYS_DATA setting) 23:30:53 [DBCLNP] Pholus_Scan: Delete all duplicates 23:30:53 [DBCLNP] Plugins_Objects: Delete all duplicates 23:31:14 [DBCLNP] Shrink Database 23:31:14 [DBCLNP] Cleanup complete 23:31:14 [Plugins] No output received from the plugin "DBCLNP" 23:31:14 [Process Scan] Exclude ignored devices 23:31:14 [Process Scan] Processing scan results 23:31:14 [Process Scan] Print Stats 23:31:14 [Scan Stats] Devices Detected.......: 8 23:31:14 [Scan Stats] New Devices............: 0 23:31:14 [Scan Stats] Down Alerts............: 0
00:40:25 CPU=88%
00:40:25 [Process Scan] Exclude ignored devices 00:40:25 [Process Scan] Processing scan results 00:40:25 [Process Scan] Print Stats 00:40:25 [Scan Stats] Devices Detected.......: 8 00:40:25 [Scan Stats] New Devices............: 0 00:40:25 [Scan Stats] Down Alerts............: 0 00:40:25 [Scan Stats] New Down Alerts........: 0 00:40:25 [Scan Stats] New Connections........: 19 00:40:25 [Scan Stats] Disconnections.........: 0 00:40:25 [Scan Stats] IP Changes.............: 0 00:40:25 [Scan Stats] Scan Method Statistics: 00:40:25 ARPSCAN: 6 00:40:25 INTRNT: 1 00:40:25 local_MAC: 1 00:40:25 [Process Scan] Stats end 00:40:25 [Process Scan] Sessions Events (connect / disconnect) 00:40:25 [Process Scan] Creating new devices 00:40:25 [Process Scan] Updating Devices Info 00:40:25 [Process Scan] Voiding false (ghost) disconnections 00:40:25 [Process Scan] Pairing session events (connection / disconnection) 00:40:25 [Process Scan] Creating sessions snapshot 00:40:25 [Process Scan] Inserting scan results into Online_History 00:40:25 [Process Scan] Skipping repeated notifications 00:40:25 [Skip Repeated Notifications] Skip Repeated 00:40:25 [Plugin utils] --------------------------------------------- 00:40:25 [Plugin utils] display_name: AVAHISCAN (Name discovery) 00:40:25 [Plugins] Executing: python3 /app/front/plugins/avahi_scan/avahi_scan.py 00:40:25 [AVAHISCAN] In script 00:40:25 [Database] Opening DB 00:40:25 [AVAHISCAN] Unknown devices count: 3 00:40:25 [AVAHISCAN] Attempt 1 - Ensuring D-Bus and Avahi daemon are running... 00:40:25 [AVAHISCAN] Avahi Daemon is already running. 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.140'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.176'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.164'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running
00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] Script finished 00:40:25 [Plugins] No output received from the plugin "AVAHISCAN" 00:40:25 [Plugin utils] --------------------------------------------- 00:40:25 [Plugin utils] display_name: NSLOOKUP (Name discovery) 00:40:25 [Plugins] Executing: python3 /app/front/plugins/nslookup_scan/nslookup.py
Anyway in settings i changed `DAYS_TO_KEEP_EVENTS` = 30 from 90 and now seems better ,but still AppEvents (9252), :
![Screenshot 2024-11-08 003516](https://github.com/user-attachments/assets/84130b40-0ec2-457d-b24d-bf885aac8498)
![image](https://github.com/user-attachments/assets/966daf32-e3a4-40d3-bc3d-505ccf6d68d6)
I will check again tomorrow and let you know ....
For monitor im using tool `Dozzle` (containers logs and cpu/mem)
For Time series Panles its tool Grafana + Telegraf (input docker plugin)
Thanks for the details!
It seems like the spikes are caused by the maintenance plugins.
You could increase the time between DBCLNP, but I wouldn't have too much of a gap between the maintenance task runs. This would reduce the number of spikes at least.
AppEvents grow really quickly because there is a lot of happening in the background - so the number doesn't surprise me. I'll consider removing that monitoring, or enabling users to turn it off, if it has too much of an impact on the application performance. AppEvents are displayed under Integrations -> Workflows.
Thanks for the details!
It seems like the spikes are caused by the maintenance plugins.
You could increase the time between DBCLNP, but I wouldn't have too much of a gap between the maintenance task runs. This would reduce the number of spikes at least.
AppEvents grow really quickly because there is a lot of happening in the background - so the number doesn't surprise me. I'll consider removing that monitoring, or enabling users to turn it off, if it has too much of an impact on the application performance. AppEvents are displayed under Integrations -> Workflows.
Thanks, I will check it again and will let you know.
I also set on new device
for all 3 scanners.
No activity for almost a week, closing for now
No activity for almost a week, closing for now
not shure whats going on to DB but its fully freezed (rip) the server ... :(
@jokob-sk it seems when db increase size Becouse of much events or/and etc and default 30sec timeout not enough for db clean it increase every time infinity . So cleaned db by increase timeout to 300sec,then set to 60sec. But I think somehow it take much cpu and mem than needed. I have about 60 containers running. But NetAlertX still on first place mostly for cpu/mem
Hi @masterwishx ,
Thanks for the update. I increased the default timeout to 300s, I hope this will prevent similar issues in the future. In general however this app will consume more resources during a scan than other apps. I am of course happy to increase the performance if you have any suggestions on how, but I am not sure what else to do. I am also happy for you or others to submit a PR.
I just implemented a graphql server in the dev branch to try to improve the performance bit. However that might only help with the front end.
Let me know if you have any suggestions. j
I am of course happy to increase the performance if you have any suggestions on how
Thanks I'm not shure what's the problem, Becouse I see cpu/mem pikes when wait, maybe Becouse sqllite Wal file is still about 1xxmb but will try to check again maybe in htop what happing
. I am also happy for you or others to submit a PR.
Shure if will find some thing useful to add
I am of course happy to increase the performance if you have any suggestions
For now my db file is 34mb but db Wal file is still 244mb. I also decreased some events etc... So as you increased db clean time this should not happen (db increase size)
But you can also control Wal file size in sqlite: https://www.sqlite.org/wal.html (paragraf 6)
Hi @masterwishx ,
Thanks for that! A question, the below indicates that there will be issues reading the DB if this happens when another process tries to access the DB.
readers might block while the checkpoint is running.
NAX has multiple processes accessing the DB. E.g.:
You can search for "SELECT" in the code base and you will find lots of front end files (php) that have access to the DB - it's a legacy code base and that's how the original author handled things...
My worry is that if I turn this on some of these might fail unexpectedly and debugging this would be pretty difficult.
I already mitigated concurrent reads of the DB by creating a middle-layer API, but it's not covering the whole front end. The application was originally not using any middle layer or caching, so everything was always read from the DB and written to it right away. This is of course not scalable so I slowly refactored a lot of the pages, but there are still plenty that rely on direct DB access.
Recently I started creating a GraphQL server middle layer, but rewriting all pages to use it takes time (it took me about a week to rewrite the Devices listing page, and I still thin there are some direct DB calls left).
Long story short, happy to improve on this, but I think this will be an ongoing effort and I worry that introducing regular check points might cause other issues, which would be much harder to deal with at this point.
Anyway, happy to hear your thoughts and suggestions. j
Anyway, happy to hear your thoughts and suggestions. j
Thanks for your amazing work, i really like your project, i know it started by another author :) ... if GraphQL can decrease CPU using by DB will be cool .
For now i deleted wal file , then sqlite made new one , so now have db=18mb wal=87mb. also checked with htop :
You can search for "SELECT" in the code base and you will find lots of front end files (php) that have access to the DB - it's a legacy code base and that's how the original author handled things...
Yeah i looked but very quickly :)
Also have a lot of :
22:53:02 [MAIN] Process: Wait
22:54:03 [Update Device Name] Trying to resolve devices without name. Unknown devices count: 3
22:54:03 [Update Device Name] Pholus entries from prev scans: 7
22:54:04 [Update Device Name] Names Found (DiG/mDNS/NSLOOKUP/NBTSCAN/Pholus): 0 (0/0/0/0/0)
22:54:04 [Update Device Name] Names Not Found : 3
22:54:04 [Notification] Check if something to report
22:54:04 [Notification] Included sections: ['new_devices', 'down_devices', 'events']
22:54:04 [Notification] No changes to report
22:54:04 [MAIN] Process: Wait
not sure how to stop trying resolve unknown names ?
Hi @masterwishx ,
I tried adding a TRUNCATE statement to the to the dbclnp plugin - let's see if that helps. I also added the wal size in the maintenance section so it's easier to check:
If you change the name to anything else than the default strings it will stop trying to resolve them, so e.g. "Not known" will stop NAX trying to resolve them.
so e.g. "Not known" will stop NAX trying to resolve them.
OK, Thanks i will change, i was thinking that maybe I missed some setting for auto change the name or something same in this case...
I tried adding a TRUNCATE statement to the to the dbclnp plugin - let's see if that helps. I also added the wal size in the maintenance section so it's easier to check:
Cool, thanks
Is there an existing issue for this?
Current Behavior
Using lasted version in docker in Unraid have a lot of cpu and mem for app at last time , db file has 1.7GB :
Expected Behavior
low usage
Steps To Reproduce
...
app.conf
docker-compose.yml
What branch are you running?
Production
app.log
...
Debug enabled