Hight CPU and MEM usage always

masterwishx commented 3 months ago

Is there an existing issue for this?

[X] I have searched the existing open and closed issues and I checked the docs https://github.com/jokob-sk/NetAlertX/tree/main/docs

Current Behavior

Using lasted version in docker in Unraid have a lot of cpu and mem for app at last time , db file has 1.7GB :

Expected Behavior

low usage

Steps To Reproduce

...

app.conf

# General
#---------------------------
LOADED_PLUGINS=['ARPSCAN','CSVBCKP','DBCLNP','DDNS','INTRNT','MAINT','NEWDEV','NSLOOKUP','NTFPRCS','PHOLUS','SETPWD','VNDRPDT','WEBHOOK','WORKFLOWS']
LOADED_PLUGINS__metadata="{}"
SCAN_SUBNETS=['192.168.0.0/24 --interface=bond0']
SCAN_SUBNETS__metadata="{}"
LOG_LEVEL='verbose'
LOG_LEVEL__metadata="{}"
TIMEZONE='Asia/Jerusalem'
TIMEZONE__metadata="{}"
PLUGINS_KEEP_HIST=10000
PLUGINS_KEEP_HIST__metadata="{}"
REPORT_DASHBOARD_URL='http://192.168.0.199:20211'
REPORT_DASHBOARD_URL__metadata="{}"
UI_LANG='English'
UI_LANG__metadata="{}"
UI_PRESENCE=['online','offline','archived']
UI_PRESENCE__metadata="{}"
UI_DEV_SECTIONS=[]
UI_DEV_SECTIONS__metadata="{}"
UI_MY_DEVICES=['online','offline','archived','new','down']
UI_MY_DEVICES__metadata="{}"
UI_NOT_RANDOM_MAC=[]
UI_NOT_RANDOM_MAC__metadata="{}"
UI_ICONS= [..redacted ...]
UI_ICONS__metadata="{}"
UI_REFRESH=0
UI_REFRESH__metadata="{}"
DAYS_TO_KEEP_EVENTS=90
DAYS_TO_KEEP_EVENTS__metadata="{}"
HRS_TO_KEEP_NEWDEV=0
HRS_TO_KEEP_NEWDEV__metadata="{}"
API_CUSTOM_SQL='SELECT * FROM Devices WHERE dev_PresentLastScan = 0'
API_CUSTOM_SQL__metadata="{}"
NETWORK_DEVICE_TYPES=['AP','Gateway','Firewall','Hypervisor','Powerline','Switch','WLAN','PLC','Router','USB LAN Adapter','USB WIFI Adapter','Internet']
NETWORK_DEVICE_TYPES__metadata="{}"
etc ...

docker-compose.yml

...

What branch are you running?

Production

app.log

...

Debug enabled

[X] I have read and followed the steps in the wiki link above and provided the required debug logs and the log section covers the time when the issue occurs.

jokob-sk commented 3 months ago

Hi,

Please post your app.log file of when the DBCLNP plugin is run. I want to check if it's executed without issues. You can also try to execute the DB cleanup plugin manually:

Also please post the stats of which tables have the most records, e.g.:

masterwishx commented 3 months ago

DBCLNP

i runned dbcleanup ...

masterwishx commented 3 months ago

i found it has error in log whencleanup :

09:01:29 [DBCLNP] In script
09:01:29 [DBCLNP] Upkeep Database:
09:01:29 [DBCLNP] Online_History: Delete all but keep latest 150 entries
09:01:29 [DBCLNP] Events: Delete all older than 90 days (DAYS_TO_KEEP_EVENTS setting)
09:01:29 [DBCLNP] Plugins_History: Trim Plugins_History entries to less than 10000 per Plugin (PLUGINS_KEEP_HIST setting)
09:01:44 [DBCLNP] Plugins_History: Trim Notifications entries to less than 100
09:01:44 [DBCLNP] Trim AppEvents to less than 5000
09:01:44 [DBCLNP] Pholus_Scan: Delete all older than 7 days (PHOLUS_DAYS_DATA setting)
09:01:44 [DBCLNP] Pholus_Scan: Delete all duplicates
09:01:44 [DBCLNP] Plugins_Objects: Delete all duplicates
09:01:59 [Plugins] ⚠ ERROR - TIMEOUT - the plugin DBCLNP forcefully terminated as timeout reached. Increase TIMEOUT setting and scan interval.
09:01:59 [Plugins] No output received from the plugin DBCLNP - enable LOG_LEVEL=debug and check logs

masterwishx commented 3 months ago

@jokob-sk so enabled debug and seems it was not enough time to clear db , so increased from 30s to 300s , so db size now 50mb but app.db-wal is still 1.6gb , should i set back to 30s ?

masterwishx commented 3 months ago

it seems still high usage in wait mode ,maybe because of app.db-wal is still 1.6gb :

jokob-sk commented 3 months ago

Hi @masterwishx ,

Glad to see there's progress. According to ChatGPT, some conditions need to be met for the .wal file to shrink. As I read this, this should happen automatically after some time. You can try to restart the container and not to access the UI for a bit so no unnecessary connections open.

If the size doesn't go down within a few hours/a day, I'd consider rebuilding the container with the CSV + setting file approach (🧭 Backup strategies -> Scenario 2: Corrupted database), but please keep backups:

https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md

I hope this will then reduce CPU utilization significantly.

Checkpointing:

    The WAL file shrinks when a checkpoint occurs. A checkpoint is the process of transferring data from the WAL file back to the main database file. This process can be triggered manually (using the PRAGMA wal_checkpoint command) or automatically when the WAL file reaches a certain size.

Completion of Checkpointing:

    For the WAL file to shrink, the checkpoint process must successfully transfer all the committed transactions to the main database file. If there are still active readers or writers on the database, the checkpoint cannot fully complete, and the WAL file will not shrink.

Truncation:

    After a checkpoint, if all transactions have been transferred, the WAL file may be truncated to zero or a smaller size. This depends on the mode of checkpointing and the settings of the SQLite database.

Closing the Database Connection:

    In some cases, when the last connection to the database is closed, SQLite will attempt to shrink the WAL file as part of the cleanup process.

Reaching a Threshold:

    SQLite may also automatically initiate a checkpoint when the WAL file reaches a certain size (default is 1000 pages or about 4MB). If this threshold is crossed and a checkpoint is successfully performed, the WAL file can shrink.

masterwishx commented 3 months ago

Hi @masterwishx ,

Glad to see there's progress. According to ChatGPT, some conditions need to be met for the .wal file to shrink. As I read this, this should happen automatically after some time. You can try to restart the container and not to access the UI for a bit so no unnecessary connections open.

If the size doesn't go down within a few hours/a day, I'd consider rebuilding the container with the CSV + setting file approach (🧭 Backup strategies -> Scenario 2: Corrupted database), but please keep backups:

https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md

I hope this will then reduce CPU utilization significantly.
Checkpointing:

    The WAL file shrinks when a checkpoint occurs. A checkpoint is the process of transferring data from the WAL file back to the main database file. This process can be triggered manually (using the PRAGMA wal_checkpoint command) or automatically when the WAL file reaches a certain size.

Completion of Checkpointing:

    For the WAL file to shrink, the checkpoint process must successfully transfer all the committed transactions to the main database file. If there are still active readers or writers on the database, the checkpoint cannot fully complete, and the WAL file will not shrink.

Truncation:

    After a checkpoint, if all transactions have been transferred, the WAL file may be truncated to zero or a smaller size. This depends on the mode of checkpointing and the settings of the SQLite database.

Closing the Database Connection:

    In some cases, when the last connection to the database is closed, SQLite will attempt to shrink the WAL file as part of the cleanup process.

Reaching a Threshold:

    SQLite may also automatically initiate a checkpoint when the WAL file reaches a certain size (default is 1000 pages or about 4MB). If this threshold is crossed and a checkpoint is successfully performed, the WAL file can shrink.

Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage

masterwishx commented 3 months ago

cackup instead backup

jokob-sk commented 3 months ago

Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage cackup instead backup

haha, thanks, fixed

masterwishx commented 3 months ago

Thanks will try it. You have typo in https://github.com/jokob-sk/NetAlertX/blob/main/docs/BACKUPS.md#data-and-cackup-storage cackup instead backup

haha, thanks, fixed

When i trying restore last backup i have in log :

masterwishx commented 3 months ago

i removed db files manually and copied from backup , but seems still high usage on CPU :

it it right permissions ?

jokob-sk commented 3 months ago

Hi @masterwishx ,

You can check the permissions by following this guide: https://github.com/jokob-sk/NetAlertX/blob/main/docs/FILE_PERMISSIONS.md

You can check in the log what is happening during those CPU spikes. It might be as simple as decreasing the frequency of scans or disabling certain plugins.

masterwishx commented 3 months ago

Hi @masterwishx ,

You can check the permissions by following this guide: https://github.com/jokob-sk/NetAlertX/blob/main/docs/FILE_PERMISSIONS.md

You can check in the log what is happening during those CPU spikes. It might be as simple as decreasing the frequency of scans or disabling certain plugins.

Not sure if related can't find exactly time from Grafana docker Telegraf but with debug i found a lot of :

masterwishx commented 3 months ago

Well, I made new container in Unraid and added all config manually and device csv it seems OK now, not shure what was a problem but was a lot of io in Gb also... I think I can close for now, maybe you can comment some think for Import config?

jokob-sk commented 3 months ago

Thasnk for the update. You can set the LOG_LEVEL to NONE or MINIMAL and these logs should disappear.

masterwishx commented 3 months ago

Thasnk for the update. You can set the LOG_LEVEL to NONE or MINIMAL and these logs should disappear.

This can't increase use of cpu?

jokob-sk commented 3 months ago

Hmmm, not sure, haven't tried/monitored the app that closely. It might be worth a try to minimize the log writes by adjusting the setting to a less verbose value, such as MINIMAL

masterwishx commented 3 weeks ago

Hmmm, not sure, haven't tried/monitored the app that closely. It might be worth a try to minimize the log writes by adjusting the setting to a less verbose value, such as MINIMAL

Hi again i have same issue :

masterwishx commented 3 weeks ago

it seems 7970 events in db but in setting 5000 events:

masterwishx commented 3 weeks ago

jokob-sk commented 3 weeks ago

Hi @masterwishx ,

it seems 7970 events in db but in setting 5000 events:

There is a delay in the cleanup script so this is expected. Bu default it runs every 30 minutes.

What are the usual values?
What scanners do you have enabled?
What is the size of your log file(s)?
Are there any exceptions in the app.log?
What is happening during the CPU spikes?

Also, what monitoring tool do you use, I might set it up myself :)

masterwishx commented 3 weeks ago

usual values ?
Scanners:
app.log now 18mb


23:33:50 [Database] Opening DB
23:33:50 [AVAHISCAN] Unknown devices count: 3
23:33:50 [AVAHISCAN] Attempt 1 - Ensuring D-Bus and Avahi daemon are running...
23:33:50 [AVAHISCAN] Avahi Daemon is already running.
23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.140']
23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.176'] 23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.164'] 23:33:50 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

23:33:50 [AVAHISCAN] Scan: FAIL - check logs 23:33:50 [AVAHISCAN] Script finished


5. on spike in `23:31:05` CPU = 95% :

23:30:53 [DBCLNP] Trim AppEvents to less than 5000 23:30:53 [DBCLNP] Pholus_Scan: Delete all older than 30 days (PHOLUS_DAYS_DATA setting) 23:30:53 [DBCLNP] Pholus_Scan: Delete all duplicates 23:30:53 [DBCLNP] Plugins_Objects: Delete all duplicates 23:31:14 [DBCLNP] Shrink Database 23:31:14 [DBCLNP] Cleanup complete 23:31:14 [Plugins] No output received from the plugin "DBCLNP" 23:31:14 [Process Scan] Exclude ignored devices 23:31:14 [Process Scan] Processing scan results 23:31:14 [Process Scan] Print Stats


`23:43:24` CPU =25%

23:43:27 [Update Device Name] Names Not Found : 3 23:43:27 [Notification] Check if something to report 23:43:27 [Notification] Included sections: ['new_devices', 'down_devices', 'events'] 23:43:27 [Notification] No changes to report 23:43:27 [MAIN] Process: Wait 23:44:30 [Plugin utils] --------------------------------------------- 23:44:30 [Plugin utils] display_name: AVAHISCAN (Name discovery) 23:44:30 [Plugins] Executing: python3 /app/front/plugins/avahi_scan/avahi_scan.py 23:44:30 [AVAHISCAN] In script

23:31:05 CPU =95%

23:30:53 [DBCLNP] Trim AppEvents to less than 5000 23:30:53 [DBCLNP] Pholus_Scan: Delete all older than 30 days (PHOLUS_DAYS_DATA setting) 23:30:53 [DBCLNP] Pholus_Scan: Delete all duplicates 23:30:53 [DBCLNP] Plugins_Objects: Delete all duplicates 23:31:14 [DBCLNP] Shrink Database 23:31:14 [DBCLNP] Cleanup complete 23:31:14 [Plugins] No output received from the plugin "DBCLNP" 23:31:14 [Process Scan] Exclude ignored devices 23:31:14 [Process Scan] Processing scan results 23:31:14 [Process Scan] Print Stats 23:31:14 [Scan Stats] Devices Detected.......: 8 23:31:14 [Scan Stats] New Devices............: 0 23:31:14 [Scan Stats] Down Alerts............: 0


00:40:25 CPU=88%

00:40:25 [Process Scan] Exclude ignored devices 00:40:25 [Process Scan] Processing scan results 00:40:25 [Process Scan] Print Stats 00:40:25 [Scan Stats] Devices Detected.......: 8 00:40:25 [Scan Stats] New Devices............: 0 00:40:25 [Scan Stats] Down Alerts............: 0 00:40:25 [Scan Stats] New Down Alerts........: 0 00:40:25 [Scan Stats] New Connections........: 19 00:40:25 [Scan Stats] Disconnections.........: 0 00:40:25 [Scan Stats] IP Changes.............: 0 00:40:25 [Scan Stats] Scan Method Statistics: 00:40:25 ARPSCAN: 6 00:40:25 INTRNT: 1 00:40:25 local_MAC: 1 00:40:25 [Process Scan] Stats end 00:40:25 [Process Scan] Sessions Events (connect / disconnect) 00:40:25 [Process Scan] Creating new devices 00:40:25 [Process Scan] Updating Devices Info 00:40:25 [Process Scan] Voiding false (ghost) disconnections 00:40:25 [Process Scan] Pairing session events (connection / disconnection) 00:40:25 [Process Scan] Creating sessions snapshot 00:40:25 [Process Scan] Inserting scan results into Online_History 00:40:25 [Process Scan] Skipping repeated notifications 00:40:25 [Skip Repeated Notifications] Skip Repeated 00:40:25 [Plugin utils] --------------------------------------------- 00:40:25 [Plugin utils] display_name: AVAHISCAN (Name discovery) 00:40:25 [Plugins] Executing: python3 /app/front/plugins/avahi_scan/avahi_scan.py 00:40:25 [AVAHISCAN] In script 00:40:25 [Database] Opening DB 00:40:25 [AVAHISCAN] Unknown devices count: 3 00:40:25 [AVAHISCAN] Attempt 1 - Ensuring D-Bus and Avahi daemon are running... 00:40:25 [AVAHISCAN] Avahi Daemon is already running. 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.140'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.176'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] DEBUG CMD :['avahi-resolve', '-a', '192.168.0.164'] 00:40:25 [AVAHISCAN] ⚠ ERROR - Failed to create client object: Daemon not running

00:40:25 [AVAHISCAN] Scan: FAIL - check logs 00:40:25 [AVAHISCAN] Script finished 00:40:25 [Plugins] No output received from the plugin "AVAHISCAN" 00:40:25 [Plugin utils] --------------------------------------------- 00:40:25 [Plugin utils] display_name: NSLOOKUP (Name discovery) 00:40:25 [Plugins] Executing: python3 /app/front/plugins/nslookup_scan/nslookup.py



Anyway in settings i changed `DAYS_TO_KEEP_EVENTS` = 30 from 90 and now seems better ,but still AppEvents (9252), :  
![Screenshot 2024-11-08 003516](https://github.com/user-attachments/assets/84130b40-0ec2-457d-b24d-bf885aac8498)

![image](https://github.com/user-attachments/assets/966daf32-e3a4-40d3-bc3d-505ccf6d68d6)

I will check again tomorrow and let you know ....

For monitor im using tool  `Dozzle` (containers logs and cpu/mem)
For Time series Panles its tool Grafana + Telegraf (input docker plugin)

jokob-sk commented 2 weeks ago

Thanks for the details!

It seems like the spikes are caused by the maintenance plugins.

You could increase the time between DBCLNP, but I wouldn't have too much of a gap between the maintenance task runs. This would reduce the number of spikes at least.

AppEvents grow really quickly because there is a lot of happening in the background - so the number doesn't surprise me. I'll consider removing that monitoring, or enabling users to turn it off, if it has too much of an impact on the application performance. AppEvents are displayed under Integrations -> Workflows.

masterwishx commented 2 weeks ago

Thanks for the details!

It seems like the spikes are caused by the maintenance plugins.

You could increase the time between DBCLNP, but I wouldn't have too much of a gap between the maintenance task runs. This would reduce the number of spikes at least.

AppEvents grow really quickly because there is a lot of happening in the background - so the number doesn't surprise me. I'll consider removing that monitoring, or enabling users to turn it off, if it has too much of an impact on the application performance. AppEvents are displayed under Integrations -> Workflows.

Thanks, I will check it again and will let you know. I also set on new device for all 3 scanners.

jokob-sk commented 1 week ago

No activity for almost a week, closing for now

masterwishx commented 1 week ago

No activity for almost a week, closing for now

not shure whats going on to DB but its fully freezed (rip) the server ... :(

masterwishx commented 1 week ago

@jokob-sk it seems when db increase size Becouse of much events or/and etc and default 30sec timeout not enough for db clean it increase every time infinity . So cleaned db by increase timeout to 300sec,then set to 60sec. But I think somehow it take much cpu and mem than needed. I have about 60 containers running. But NetAlertX still on first place mostly for cpu/mem

jokob-sk commented 1 week ago

Hi @masterwishx ,

Thanks for the update. I increased the default timeout to 300s, I hope this will prevent similar issues in the future. In general however this app will consume more resources during a scan than other apps. I am of course happy to increase the performance if you have any suggestions on how, but I am not sure what else to do. I am also happy for you or others to submit a PR.

I just implemented a graphql server in the dev branch to try to improve the performance bit. However that might only help with the front end.

Let me know if you have any suggestions. j

masterwishx commented 1 week ago

I am of course happy to increase the performance if you have any suggestions on how

Thanks I'm not shure what's the problem, Becouse I see cpu/mem pikes when wait, maybe Becouse sqllite Wal file is still about 1xxmb but will try to check again maybe in htop what happing

masterwishx commented 1 week ago

. I am also happy for you or others to submit a PR.

Shure if will find some thing useful to add

masterwishx commented 1 week ago

I am of course happy to increase the performance if you have any suggestions

For now my db file is 34mb but db Wal file is still 244mb. I also decreased some events etc... So as you increased db clean time this should not happen (db increase size)

But you can also control Wal file size in sqlite: https://www.sqlite.org/wal.html (paragraf 6)

jokob-sk commented 1 week ago

Hi @masterwishx ,

Thanks for that! A question, the below indicates that there will be issues reading the DB if this happens when another process tries to access the DB.

readers might block while the checkpoint is running.

NAX has multiple processes accessing the DB. E.g.:

Front end
APIs (e.g. Homepage)
Backend
Plugins (e.g. Sync plugin)

You can search for "SELECT" in the code base and you will find lots of front end files (php) that have access to the DB - it's a legacy code base and that's how the original author handled things...

My worry is that if I turn this on some of these might fail unexpectedly and debugging this would be pretty difficult.

I already mitigated concurrent reads of the DB by creating a middle-layer API, but it's not covering the whole front end. The application was originally not using any middle layer or caching, so everything was always read from the DB and written to it right away. This is of course not scalable so I slowly refactored a lot of the pages, but there are still plenty that rely on direct DB access.

Recently I started creating a GraphQL server middle layer, but rewriting all pages to use it takes time (it took me about a week to rewrite the Devices listing page, and I still thin there are some direct DB calls left).

Long story short, happy to improve on this, but I think this will be an ongoing effort and I worry that introducing regular check points might cause other issues, which would be much harder to deal with at this point.

Anyway, happy to hear your thoughts and suggestions. j

masterwishx commented 1 week ago

Anyway, happy to hear your thoughts and suggestions. j

Thanks for your amazing work, i really like your project, i know it started by another author :) ... if GraphQL can decrease CPU using by DB will be cool .

For now i deleted wal file , then sqlite made new one , so now have db=18mb wal=87mb. also checked with htop :

Screenshot 2024-11-19 113213 Screenshot 2024-11-19 113059

You can search for "SELECT" in the code base and you will find lots of front end files (php) that have access to the DB - it's a legacy code base and that's how the original author handled things...

Yeah i looked but very quickly :)

Also have a lot of :

22:53:02 [MAIN] Process: Wait
22:54:03 [Update Device Name] Trying to resolve devices without name. Unknown devices count: 3
22:54:03 [Update Device Name] Pholus entries from prev scans: 7
22:54:04 [Update Device Name] Names Found (DiG/mDNS/NSLOOKUP/NBTSCAN/Pholus): 0 (0/0/0/0/0)
22:54:04 [Update Device Name] Names Not Found         : 3
22:54:04 [Notification] Check if something to report
22:54:04 [Notification] Included sections: ['new_devices', 'down_devices', 'events']
22:54:04 [Notification] No changes to report
22:54:04 [MAIN] Process: Wait

not sure how to stop trying resolve unknown names ?

jokob-sk commented 1 week ago

Hi @masterwishx ,

I tried adding a TRUNCATE statement to the to the dbclnp plugin - let's see if that helps. I also added the wal size in the maintenance section so it's easier to check:

If you change the name to anything else than the default strings it will stop trying to resolve them, so e.g. "Not known" will stop NAX trying to resolve them.

masterwishx commented 1 week ago

so e.g. "Not known" will stop NAX trying to resolve them.

OK, Thanks i will change, i was thinking that maybe I missed some setting for auto change the name or something same in this case...

masterwishx commented 1 week ago

I tried adding a TRUNCATE statement to the to the dbclnp plugin - let's see if that helps. I also added the wal size in the maintenance section so it's easier to check:

Cool, thanks

jokob-sk / NetAlertX