sipcapture / homer7-docker

HOMER 7 Docker Images
93 stars 84 forks source link

postgres not honoring rotation env variable configuration #131

Closed n8twj closed 1 year ago

n8twj commented 1 year ago

I am running hom7-prom-all example docker-config from github. none of the database rotation environment variables are working

I have tried:

and

  - "HEPLIFYSERVER_DBUSAGESCHEME=percentage"
  - "HEPLIFYSERVER_DBPERCENTAGEUSAGE=75%"

The database does not honor either setting

When I exec into heplify-server the help screen acts like it sees the values, but doesn't honor them

-dbusagescheme Change value of DBUsageScheme. (default percentage)

-dbpercentageusage Change value of DBPercentageUsage. (default 75%)

We are ONLY sending INVITE transactions to HEP via kamailio 5.6 siptrace module


route[HOMER] {
    sip_trace_mode("t");
    sip_trace();
    setflag(22);
}
pyysa commented 1 year ago

Have you checked that when docker starts the container, the first line in the docker logs that is written in heplify-server container? Does it include you settings correctly?

n8twj commented 1 year ago

When I exec into heplify-server the help screen acts like it sees the values, but doesn't honor them

-dbusagescheme Change value of DBUsageScheme. (default percentage)

-dbpercentageusage Change value of DBPercentageUsage. (default 75%)

n8twj commented 1 year ago

DBDropDaysRegister:1, DBDropDaysDefault:1, DBDropOnStart:true, DBUsageProtection:true, DBUsageScheme:"percentage", DBPercentageUsage:"50%", DBMaxSize:"20GB",

Even changed 75% to 50% in docker-compose.yml and change shows up on start

DBMaxSize also not getting honored DBDropOnStart also not getting honored

n8twj commented 1 year ago

Screen Shot 2022-10-03 at 9 35 23 AM

This is one hour chart, with one hour of calling starting at 0900 with only 27 calls per second [not even 1/4 of our traffic]

No Register, Notify, Subscribe, nor Options are getting sent to HEP

n8twj commented 1 year ago

2022/10/03 13:58:37.449761 server.go:297: INFO stats since last 5 minutes. PPS: 1601, HEP: 480303, Filtered: 0, Error: 0 2022/10/03 14:08:37.449725 server.go:297: INFO stats since last 5 minutes. PPS: 1563, HEP: 468991, Filtered: 0, Error: 0

lmangani commented 1 year ago

are you sure its not logs or something else filling up disks? the math doesn't add up

n8twj commented 1 year ago

this is a dedicated instance only running hom7-prom-all docker-compose example.

$ sudo du -sh * 4.0K alertmanager 4.0K bootstrap 8.0K docker-compose.yml 160K grafana 4.0K init-user-db.sh 8.0K loki 7.9G postgres-data 8.0K prometheus 4.0K README.md

In an hour postgres-data will be double that size

n8twj commented 1 year ago

$ sudo du -sh * 4.0K alertmanager 4.0K bootstrap 8.0K docker-compose.yml 160K grafana 4.0K init-user-db.sh 8.0K loki 8.6G postgres-data 8.0K prometheus 4.0K README.md [ec2-user@ip-172-31-42-201 hom7-prom-all]$ date Mon Oct 3 15:29:52 UTC 2022

(Not even 60 seconds later)

30 CPS

n8twj commented 1 year ago

[ec2-user@ip-172-31-42-201 hom7-prom-all]$ sudo du -sh 4.0K alertmanager 4.0K bootstrap 8.0K docker-compose.yml 160K grafana 4.0K init-user-db.sh 8.0K loki 9.1G postgres-data 8.0K prometheus 4.0K README.md [ec2-user@ip-172-31-42-201 hom7-prom-all]$ date Mon Oct 3 15:31:46 UTC 2022 [ec2-user@ip-172-31-42-201 hom7-prom-all]$ sudo du -sh 4.0K alertmanager 4.0K bootstrap 8.0K docker-compose.yml 160K grafana 4.0K init-user-db.sh 8.0K loki 11G postgres-data 8.0K prometheus 4.0K README.md [ec2-user@ip-172-31-42-201 hom7-prom-all]$ date Mon Oct 3 15:36:24 UTC 2022

lmangani commented 1 year ago

In doubt, try adding this to the postgres container in your compose and see if it nukes any size down.

 logging:
      options:
        max-size: "10m"
n8twj commented 1 year ago

so logging by default knows how to access postgres? ...without feeding it any credentials

n8twj commented 1 year ago

There is already a max logging on every service

 logging:
      options:
    max-size: "50m"
n8twj commented 1 year ago

I asked someone else with significantly more Postgress experience than myself to take a look. He said under no circumstances should Postgres run as "root" and use the "root" database user. The moment we switched away from db user "root" the hard drive space usage instantly stopped being burnt up.

- "HEPLIFYSERVER_DBUSER=db"

POSTGRES_USER: db

test: ["CMD-SHELL", "psql -h 'localhost' -U 'db' -c '\\l'"]

n8twj commented 1 year ago

While the postgres issue is important to change... Another issue was discovered that somehow the depends_on db requirement wasn't being honored.

once we restart heplify-server after the rest of the stack is up, this issue returns (and homer-webapp works as expected)

lmangani commented 1 year ago

The postgres container used in this recipe is the official one from postgres. The code you're referring to is just a healthcheck, and does not imply postgres is being ran as root. I guess this means with your changes, postgres was never being started and this was the root cause of the other issues?

n8twj commented 1 year ago

But in the official configuration, they do not run the "root" db user - There is even a big red warning if you attempt to do so in the postgres cli

n8twj commented 1 year ago

We are now back to burning up 25 GIB of hard drive space in about 2 hours with a maximum of 18 CPS.

n8twj commented 1 year ago
[ec2-user@ip-172-31-42-201 hom7-prom-all]$ docker-compose up -d
Creating network "hom7-prom-all_default" with the default driver
Creating grafana      ... done
Creating prometheus ... done
Creating nodeexporter ... done
Creating alertmanager ... done
Creating db           ... done
Creating heplify-server ... done
Creating homer-webapp   ... done
[ec2-user@ip-172-31-42-201 hom7-prom-all]$ docker logs heplify-server
Could not find toml config file, use flag defaults. <nil>
2022/10/05 16:36:26.662120 server.go:93: INFO start heplify-server 1.57 with config.HeplifyServer{HEPAddr:"0.0.0.0:9060", HEPTCPAddr:"0.0.0.0:9061", HEPTLSAddr:"", HEPWSAddr:"", ESAddr:"", ESDiscovery:true, HEPv2Enable:true, ESUser:"", ESPass:"", LokiURL:"", LokiBulk:400, LokiTimer:4, LokiBuffer:100000, LokiHEPFilter:[]int{1, 5, 100}, ForceHEPPayload:[]int(nil), PromAddr:"0.0.0.0:9096", PromTargetIP:"", PromTargetName:"", DBShema:"homer7", DBDriver:"postgres", DBAddr:"db:5432", DBSSLMode:"disable", DBUser:"db", DBPass:"<private>", DBDataTable:"homer_data", DBConfTable:"homer_config", DBBulk:400, DBTimer:4, DBBuffer:400000, DBWorker:8, DBRotate:true, DBPartLog:"2h", DBPartIsup:"6h", DBPartSip:"2h", DBPartQos:"6h", DBDropDays:1, DBDropDaysCall:1, DBDropDaysRegister:1, DBDropDaysDefault:1, DBDropOnStart:true, DBUsageProtection:false, DBUsageScheme:"percentage", DBPercentageUsage:"50%", DBMaxSize:"20GB", DBProcDropLimit:2, Dedup:false, DiscardMethod:[]string(nil), CensorMethod:[]string(nil), AlegIDs:[]string(nil), ForceALegID:false, CustomHeader:[]string(nil), IgnoreCaseCH:false, SIPHeader:[]string{"ruri_user", "ruri_domain", "from_user", "from_tag", "to_user", "callid", "cseq", "method", "user_agent"}, LogDbg:"", LogLvl:"info", LogStd:true, LogSys:false, Config:"./heplify-server.toml", ConfigHTTPAddr:"", ConfigHTTPPW:"", Version:false, ScriptEnable:false, ScriptEngine:"lua", ScriptFolder:"", ScriptHEPFilter:[]int{1, 5, 100}, TLSCertFolder:".", TLSMinVersion:"1.2"}
2022/10/05 16:36:26.662595 prometheus.go:38: INFO expose metrics without or unbalanced targets
2022/10/05 16:36:26.667676 rotator.go:539: INFO start creating tables (2022-10-05 16:36:26.667640402 +0000 UTC m=+0.074928689)
2022/10/05 16:36:26.670003 rotator.go:544: ERR dial tcp 172.18.0.6:5432: connect: connection refused
2022/10/05 16:36:26.674425 rotator.go:547: ERR dial tcp 172.18.0.6:5432: connect: connection refused
2022/10/05 16:36:26.675225 rotator.go:550: ERR dial tcp 172.18.0.6:5432: connect: connection refused
2022/10/05 16:36:26.675356 rotator.go:552: INFO end creating tables (2022-10-05 16:36:26.675335751 +0000 UTC m=+0.082623925)
2022/10/05 16:36:26.677122 rotator.go:555: ERR dial tcp 172.18.0.6:5432: connect: connection refused
2022/10/05 16:36:26.677713 server.go:152: ERR dial tcp 172.18.0.6:5432: connect: connection refused
n8twj commented 1 year ago
$ docker logs db
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found
2022-10-05 16:36:26.288 UTC [31] WARNING:  no usable system locales were found
performing post-bootstrap initialization ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/data -l logfile start

waiting for server to start....2022-10-05 16:36:26.999 UTC [42] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-05 16:36:27.017 UTC [43] LOG:  database system was shut down at 2022-10-05 16:36:26 UTC
2022-10-05 16:36:27.020 UTC [42] LOG:  database system is ready to accept connections
 done
server started
CREATE DATABASE

/usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/init-user-db.sh
CREATE DATABASE

waiting for server to shut down....2022-10-05 16:36:27.594 UTC [42] LOG:  received fast shutdown request
2022-10-05 16:36:27.595 UTC [42] LOG:  aborting any active transactions
2022-10-05 16:36:27.596 UTC [42] LOG:  background worker "logical replication launcher" (PID 49) exited with exit code 1
2022-10-05 16:36:27.597 UTC [44] LOG:  shutting down
2022-10-05 16:36:27.607 UTC [42] LOG:  database system is shut down
 done
server stopped

PostgreSQL init process complete; ready for start up.

2022-10-05 16:36:27.706 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-10-05 16:36:27.706 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-10-05 16:36:27.709 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-05 16:36:27.724 UTC [58] LOG:  database system was shut down at 2022-10-05 16:36:27 UTC
2022-10-05 16:36:27.727 UTC [1] LOG:  database system is ready to accept connections
2022-10-05 16:36:29.218 UTC [78] ERROR:  database "homer_config" already exists
2022-10-05 16:36:29.218 UTC [78] STATEMENT:  CREATE DATABASE homer_config OWNER db
lmangani commented 1 year ago

The heplify-server can't connect the postgres container - this is common when things go wrong inside a container and/or when healthchecks are not working as intended - or both. Perhaps you shouldn't use docker and rather install the real components on your system for full control?

n8twj commented 1 year ago

I then immedately do:

$ docker restart heplify-server

Could not find toml config file, use flag defaults. <nil>
2022/10/05 16:39:44.048884 server.go:93: INFO start heplify-server 1.57 with config.HeplifyServer{HEPAddr:"0.0.0.0:9060", HEPTCPAddr:"0.0.0.0:9061", HEPTLSAddr:"", HEPWSAddr:"", ESAddr:"", ESDiscovery:true, HEPv2Enable:true, ESUser:"", ESPass:"", LokiURL:"", LokiBulk:400, LokiTimer:4, LokiBuffer:100000, LokiHEPFilter:[]int{1, 5, 100}, ForceHEPPayload:[]int(nil), PromAddr:"0.0.0.0:9096", PromTargetIP:"", PromTargetName:"", DBShema:"homer7", DBDriver:"postgres", DBAddr:"db:5432", DBSSLMode:"disable", DBUser:"db", DBPass:"<private>", DBDataTable:"homer_data", DBConfTable:"homer_config", DBBulk:400, DBTimer:4, DBBuffer:400000, DBWorker:8, DBRotate:true, DBPartLog:"2h", DBPartIsup:"6h", DBPartSip:"2h", DBPartQos:"6h", DBDropDays:1, DBDropDaysCall:1, DBDropDaysRegister:1, DBDropDaysDefault:1, DBDropOnStart:true, DBUsageProtection:false, DBUsageScheme:"percentage", DBPercentageUsage:"50%", DBMaxSize:"20GB", DBProcDropLimit:2, Dedup:false, DiscardMethod:[]string(nil), CensorMethod:[]string(nil), AlegIDs:[]string(nil), ForceALegID:false, CustomHeader:[]string(nil), IgnoreCaseCH:false, SIPHeader:[]string{"ruri_user", "ruri_domain", "from_user", "from_tag", "to_user", "callid", "cseq", "method", "user_agent"}, LogDbg:"", LogLvl:"info", LogStd:true, LogSys:false, Config:"./heplify-server.toml", ConfigHTTPAddr:"", ConfigHTTPPW:"", Version:false, ScriptEnable:false, ScriptEngine:"lua", ScriptFolder:"", ScriptHEPFilter:[]int{1, 5, 100}, TLSCertFolder:".", TLSMinVersion:"1.2"}
2022/10/05 16:39:44.050488 prometheus.go:38: INFO expose metrics without or unbalanced targets
2022/10/05 16:39:44.050575 rotator.go:539: INFO start creating tables (2022-10-05 16:39:44.050561264 +0000 UTC m=+0.013321060)
2022/10/05 16:39:50.271276 rotator.go:552: INFO end creating tables (2022-10-05 16:39:50.271254045 +0000 UTC m=+6.234013863)
2022/10/05 16:39:50.505989 postgres.go:61: INFO postgres connection established

which then makes homer-webapp happy again [ie we can search for traces and view them].... but also brings back the issue of the database not getting rotated

lmangani commented 1 year ago

That's what the healthcheck is for - it holds other containers until the DB is ready. Have you removed it by any chance?

n8twj commented 1 year ago

I presume you mean this:

  healthcheck:
      test: ["CMD-SHELL", "psql -h 'localhost' -U 'db' -c '\\l'"]
      interval: 1s
      timeout: 3s
      retries: 30

Which seems to state it is healthy:

2a45c8bbb6d5   postgres:11-alpine          "docker-entrypoint.s…"   8 minutes ago   Up 8 minutes (healthy)   5432/tcp                                                                                                                                              db
n8twj commented 1 year ago

very simply issuing a docker restart heplify-server AFTER docker-compose up -d fixes the problem...thus depends_on db check is not working as intended.

n8twj commented 1 year ago

...which brings us back to the lecture at hand... homer-webapp works again and hard drive space isn't being released by heplify-server app

lmangani commented 1 year ago

Our only suggestion would be to replace the Docker postgres container with an actual Postgres on the host, and to investigate the space usage on the instance hands-on. Once we know what the issue is, we can most definitely fix it.

n8twj commented 1 year ago

Google tells me I am not the only person with this issue

https://groups.google.com/g/homer-discuss/c/bDsiFM3MpBQ

n8twj commented 1 year ago

Below are all logs from heplify-server. If you notice "INFO start creating tables" is called, so we are in the correct bit of code... however this code never gets called...

https://github.com/sipcapture/heplify-server/blob/3294c40c4e56bcfecfe621b5f5c4d8fcfbeb92b9/rotator/rotator.go#L499

Could not find toml config file, use flag defaults. <nil>
2022/10/17 21:23:24.914792 server.go:93: INFO start heplify-server 1.57 with config.HeplifyServer{HEPAddr:"0.0.0.0:9060", HEPTCPAddr:"0.0.0.0:9061", HEPTLSAddr:"", HEPWSAddr:"", ESAddr:"", ESDiscovery:true, HEPv2Enable:true, ESUser:"", ESPass:"", LokiURL:"", LokiBulk:400, LokiTimer:4, LokiBuffer:100000, LokiHEPFilter:[]int{1, 5, 100}, ForceHEPPayload:[]int(nil), PromAddr:"0.0.0.0:9096", PromTargetIP:"", PromTargetName:"", DBShema:"homer7", DBDriver:"postgres", DBAddr:"db:5432", DBSSLMode:"disable", DBUser:"db", DBPass:"<private>", DBDataTable:"homer_data", DBConfTable:"homer_config", DBBulk:400, DBTimer:4, DBBuffer:400000, DBWorker:10, DBRotate:true, DBPartLog:"2h", DBPartIsup:"6h", DBPartSip:"2h", DBPartQos:"6h", DBDropDays:1, DBDropDaysCall:1, DBDropDaysRegister:1, DBDropDaysDefault:1, DBDropOnStart:true, DBUsageProtection:false, DBUsageScheme:"percentage", DBPercentageUsage:"50%", DBMaxSize:"20GB", DBProcDropLimit:2, Dedup:false, DiscardMethod:[]string(nil), CensorMethod:[]string(nil), AlegIDs:[]string(nil), ForceALegID:false, CustomHeader:[]string(nil), IgnoreCaseCH:false, SIPHeader:[]string{"ruri_user", "ruri_domain", "from_user", "from_tag", "to_user", "callid", "cseq", "method", "user_agent"}, LogDbg:"", LogLvl:"info", LogStd:true, LogSys:false, Config:"./heplify-server.toml", ConfigHTTPAddr:"", ConfigHTTPPW:"", Version:false, ScriptEnable:false, ScriptEngine:"lua", ScriptFolder:"", ScriptHEPFilter:[]int{1, 5, 100}, TLSCertFolder:".", TLSMinVersion:"1.2"}
2022/10/17 21:23:24.915263 prometheus.go:38: INFO expose metrics without or unbalanced targets
2022/10/17 21:23:24.916112 rotator.go:539: INFO start creating tables (2022-10-17 21:23:24.915489474 +0000 UTC m=+0.017967068)
2022/10/17 21:23:34.717995 rotator.go:552: INFO end creating tables (2022-10-17 21:23:34.717966654 +0000 UTC m=+9.820444224)
2022/10/17 21:23:35.113477 postgres.go:61: INFO postgres connection established
2022/10/17 21:28:24.915553 server.go:297: INFO stats since last 5 minutes. PPS: 439, HEP: 131735, Filtered: 0, Error: 0
2022/10/17 21:33:24.915660 server.go:297: INFO stats since last 5 minutes. PPS: 1278, HEP: 383609, Filtered: 0, Error: 0
2022/10/17 21:38:24.915470 server.go:297: INFO stats since last 5 minutes. PPS: 1034, HEP: 310394, Filtered: 0, Error: 0
2022/10/17 21:43:24.915481 server.go:297: INFO stats since last 5 minutes. PPS: 501, HEP: 150485, Filtered: 0, Error: 0
n8twj commented 1 year ago

not even at 45 04 *

n8twj commented 1 year ago

plus - why would such a job only run once a day?

n8twj commented 1 year ago
2022/10/18 03:46:24.490337 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 1, Filtered: 0, Error: 2
2022/10/18 03:51:24.490257 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 03:56:24.490278 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:01:24.490243 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:06:24.491181 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:11:24.490237 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:16:24.490274 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:21:24.490183 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:26:24.490197 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:31:24.490215 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:36:24.493830 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:41:24.490255 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:46:24.490275 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:51:24.490165 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0
2022/10/18 04:56:24.490283 server.go:297: INFO stats since last 5 minutes. PPS: 0, HEP: 0, Filtered: 0, Error: 0