signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
658 stars 168 forks source link

Memory leak when slave server unavailable #231

Open preffect opened 6 years ago

preffect commented 6 years ago

When running replication-manager as a service there appears to be a memory leak when the slave server is not available. I can easily reproduce this and see a consistent ~6MB / hour and have seen it get as high as 1GB total.

Note. I am aware there is a memory leak in http-auth, but i have both that and the http-server turned off.

Is this a known issue? Is there anything thing else I can add to this to help track it down?

Mariadb version: 10.1.22 Replication manager version: 2.0.0-11-gc3654c71 OS: CentOS Linux release 7.3.1611 (Core)

TOPOLOGY CONFIG

--------

db-servers-hosts = "127.0.0.1:3306,remove.server.com:3306" db-servers-credential = "user:pass" replication-credential = "slave_user:pass" db-servers-connect-timeout = 1 db-servers-prefered-master = "127.0.0.1:3306"

HTTP

-------

http-server = false http-bind-address = "0.0.0.0" http-port = "10001" http-root = "/usr/share/replication-manager/dashboard" http-auth = false http-session-lifetime = 3600 http-bootstrap-button = false

tanji commented 6 years ago

We are not aware of such issues, we'll try to reproduce it ASAP. Thanks!

svaroqui commented 6 years ago

Hi,

Can you try to reproduce with last 2.0 build. If you can still reproduce with the last build than you can send me 2 extract of
http://127.0.0.1:10001/debug/pprof/heap at different time (wating for the memory to grow before the second extract)

tx /svar

svaroqui commented 6 years ago

also issue with Auth is removed in 2.1 by using the https://server:10005/ witch is true secured JWT login

preffect commented 6 years ago

Hi svar,

I've upgraded to 2.0.0-21-g6c87cf3f and the memory leak still exists. I've attached 3 heap logs, each roughly an hour apart.

I've also attached a running counter of memory used by the replication-manager-osc monitor. As you can see the monitor memory usage increases in chunks of roughly 1MB every 10 minutes.

heap_2018.05.11_11:52.gz heap_2018.05.11_10:35.gz heap_2018.05.11_09:19.gz replication-manager-memory.log

svaroqui commented 6 years ago

Re,

I'm not able to reproduce yet event trying hard on centos . Can you send us the content of the file clusterstate.json under /var/lib/replication/cluster-name/

tx /svar

preffect commented 6 years ago

Doesn't look like there's anything terribly interesting...

{
    "servers": "127.0.0.1:3306,remote.server.com:3306",
    "crashes": null,
    "sla": {
        "firsttime": 1523297297,
        "uptime": 0,
        "uptimeFailable": 0,
        "uptimeSemisync": 0
    }
}

On a positive note, I've been running this on 40 other server pairs for a month now with no issues. It definitely seems to be in issue only with servers that can't connect to their failover. Actually, the two servers I'm seeing the memory leak on we accidentally setup with replication-manager even though they don't have failover servers. Maybe the leak is something that only happens during the initial setup?

svaroqui commented 6 years ago

I’m testing a patch on 2.1 when ready would you like to test the 2.1 to see if that can be reproductible as well !

Le 15 mai 2018 à 18:29, Conan Morris notifications@github.com a écrit :

Doesn't look like there's anything terribly interesting...

{ "servers": "127.0.0.1:3306,remote.server.com:3306", "crashes": null, "sla": { "firsttime": 1523297297, "uptime": 0, "uptimeFailable": 0, "uptimeSemisync": 0 } } On a positive note, I've been running this on 40 other server pairs for a month now with no issues. It definitely seems to be in issue only with servers that can't connect to their failover. Actually, the two servers I'm seeing the memory leak on we accidentally setup with replication-manager even though they don't have failover servers. Maybe the leak is something that only happens during the initial setup?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/231#issuecomment-389230076, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIB5sZXidQxF6bDJUxaEi7xiYPGmNks5tywJrgaJpZM4T6Np8.

Stéphane Varoqui, VP of Products Phone: +33 695-926-401, skype: svaroqui https://signal18.io/ https://signal18.io/

preffect commented 6 years ago

Sure. Will it be released soon? I'll watch for it.

svaroqui commented 6 years ago

Got some news ! I possibly found a cause about this issue ! If you get time to confirm hat using a non existing ip address instead of a hostname fix the leak , this would validate my founding !

I'll work for on a fix and let you know when available !

preffect commented 6 years ago

I tried setting the replication slave to an IP that wan't assigned to anything, but I still see the memory leak.

I was able to confirm that the leak also occurs on a server pair that was fully setup. I shut down one of my slave servers for a little over 1/2 an hour, and memory used went from 31468K to 34636K in three ~1MB steps.

After turning the slave server back on again, the leak stopped, but it did not recover the lost memory. It seems the only way to recover is to restart replication-manager.