miekg / dns

DNS library in Go
https://miek.nl/2014/august/16/go-dns-package
BSD 3-Clause "New" or "Revised" License
8k stars 1.13k forks source link

Memory Leak in github.com/miekg/dns.parseZone #786

Closed ghost closed 5 years ago

ghost commented 6 years ago

I'm using coredns 1.2.2 with the built-in file plugin (https://github.com/coredns/coredns/tree/master/plugin/file) to serve a few RFC 1035 master zone files. coredns uses github.com/miekg/dns. I'm also using the more resent version of Go (1.11.1).

In this configuration, coredns memory usage grows steadily and crashes after about 18 hours with an OOM error with a server that contains 2GB of memory. Here's a snippet from the coredns log file showing the crash:

2018/10/10 13:56:17 [INFO] CoreDNS-1.2.2 2018/10/10 13:56:17 [INFO] linux/amd64, go1.11.1, 8a9c6174 ... 1360 fatal error: runtime: out of memory 1361 1362 runtime stack: 1363 runtime.throw(0x192f945, 0x16) 1364 /home/ubuntu/go1.11/src/runtime/panic.go:608 +0x72 1365 runtime.sysMap(0xc064000000, 0x4000000, 0x29ef978) 1366 /home/ubuntu/go1.11/src/runtime/mem_linux.go:156 +0xc7 1367 runtime.(mheap).sysAlloc(0x29d59a0, 0x4000000, 0x29d59b8, 0x7fd3ecb43a28) 1368 /home/ubuntu/go1.11/src/runtime/malloc.go:619 +0x1c7 1369 runtime.(mheap).grow(0x29d59a0, 0x1, 0x0) 1370 /home/ubuntu/go1.11/src/runtime/mheap.go:920 +0x42 1371 runtime.(mheap).allocSpanLocked(0x29d59a0, 0x1, 0x29ef988, 0x0) 1372 /home/ubuntu/go1.11/src/runtime/mheap.go:848 +0x337 1373 runtime.(mheap).alloc_m(0x29d59a0, 0x1, 0x430008, 0xc000044500) 1374 /home/ubuntu/go1.11/src/runtime/mheap.go:692 +0x119 1375 runtime.(mheap).alloc.func1() 1376 /home/ubuntu/go1.11/src/runtime/mheap.go:759 +0x4c 1377 runtime.(mheap).alloc(0x29d59a0, 0x1, 0xc000010008, 0x432339) 1378 /home/ubuntu/go1.11/src/runtime/mheap.go:758 +0x8a 1379 runtime.(mcentral).grow(0x29d6ed8, 0x0) 1380 /home/ubuntu/go1.11/src/runtime/mcentral.go:232 +0x94 1381 runtime.(mcentral).cacheSpan(0x29d6ed8, 0xc063aef0d8) 1382 /home/ubuntu/go1.11/src/runtime/mcentral.go:106 +0x2f8 1383 runtime.(mcache).refill(0x7fd3f5229000, 0xc0000a7b08) 1384 /home/ubuntu/go1.11/src/runtime/mcache.go:122 +0x95 1385 runtime.(mcache).nextFree.func1() 1386 /home/ubuntu/go1.11/src/runtime/malloc.go:749 +0x32 1387 runtime.systemstack(0xc00008cd00) 1388 /home/ubuntu/go1.11/src/runtime/asm_amd64.s:351 +0x66 1389 runtime.mstart() 1390 /home/ubuntu/go1.11/src/runtime/proc.go:1229 1392 goroutine 4058 [running]: 1393 runtime.systemstack_switch() 1394 /home/ubuntu/go1.11/src/runtime/asm_amd64.s:311 fp=0xc0620b7508 sp=0xc0620b7500 pc=0x45a340 1395 runtime.(*mcache).nextFree(0x7fd3f5229000, 0x8, 0xc06370d3c0, 0x0, 0xc000146a80) 1396 /home/ubuntu/go1.11/src/runtime/malloc.go:748 +0xb6 fp=0xc0620b7560 sp=0xc0620b7508 pc=0x40b7b6 1397 runtime.mallocgc(0x30, 0x176dfc0, 0x1, 0xc06370d3c0) 1398 /home/ubuntu/go1.11/src/runtime/malloc.go:903 +0x793 fp=0xc0620b7600 sp=0xc0620b7560 pc=0x40c103 1399 runtime.newobject(0x176dfc0, 0x446d56) 1400 /home/ubuntu/go1.11/src/runtime/malloc.go:1032 +0x38 fp=0xc0620b7630 sp=0xc0620b7600 pc=0x40c4e8 1401 github.com/miekg/dns.setCNAME(0xc063ff3770, 0x23, 0x3c00010005, 0x0, 0xc063aef080, 0xc0000a1d86, 0x12, 0xc000092b40, 0x22, 0x23, ...) 1402 /home/ubuntu/go/src/github.com/miekg/dns/scan_rr.go:515 +0x3f fp=0xc0620b7730 sp=0xc0620b7630 pc=0x71d01f 1403 github.com/miekg/dns.setRR(0xc063ff3770, 0x23, 0x3c00010005, 0x0, 0xc063aef080, 0xc0000a1d86, 0x12, 0xc000092b40, 0x22, 0x1b68b80, ...) 1404 /home/ubuntu/go/src/github.com/miekg/dns/scan_rr.go:29 +0xdd fp=0xc0620b78c0 sp=0xc0620b7730 pc=0x71731d 1405 github.com/miekg/dns.parseZone(0x1b44740, 0xc00000cea0, 0xc0000a1d86, 0x12, 0xc000092b40, 0x22, 0xc014998348, 0xc063256180, 0x0) 1406 /home/ubuntu/go/src/github.com/miekg/dns/scan.go:469 +0x1fdc fp=0xc0620b7f98 sp=0xc0620b78c0 pc=0x71292c 1407 runtime.goexit() 1408 /home/ubuntu/go1.11/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc0620b7fa0 sp=0xc0620b7f98 pc=0x45c2a1 1409 created by github.com/miekg/dns.parseZoneHelper 1410 /home/ubuntu/go/src/github.com/miekg/dns/scan.go:162 +0xc3

I enabled profiling in the coredns config file and have been watching the heap. Here are two examples showing this issue. Notice that github.com/miekg/dns.parseZone grows continually by about 100MB each hour:

File: coredns Type: inuse_space Time: Oct 11, 2018 at 3:43pm (EDT) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top5 Showing nodes accounting for 385.64MB, 95.53% of 403.68MB total Dropped 11 nodes (cum <= 2.02MB) Showing top 5 nodes out of 23 flat flat% sum% cum cum% 106MB 26.26% 26.26% 303.02MB 75.06% github.com/miekg/dns.parseZone 98.50MB 24.40% 50.66% 98.50MB 24.40% github.com/miekg/dns.appendOrigin (inline) 98MB 24.28% 74.94% 98MB 24.28% github.com/miekg/dns.setCNAME 68MB 16.85% 91.78% 68MB 16.85% github.com/miekg/dns.zlexer 15.12MB 3.75% 95.53% 15.12MB 3.75% github.com/miekg/dns.parseZoneHelper

File: coredns Type: inuse_space Time: Oct 11, 2018 at 4:58pm (EDT) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top5 Showing nodes accounting for 512.96MB, 96.51% of 531.51MB total Dropped 11 nodes (cum <= 2.66MB) Showing top 5 nodes out of 23 flat flat% sum% cum cum% 138.01MB 25.96% 25.96% 138.01MB 25.96% github.com/miekg/dns.appendOrigin (inline) 137.51MB 25.87% 51.84% 404.02MB 76.01% github.com/miekg/dns.parseZone 127.51MB 23.99% 75.83% 127.51MB 23.99% github.com/miekg/dns.setCNAME 90.50MB 17.03% 92.85% 90.50MB 17.03% github.com/miekg/dns.zlexer

Here is the coredns config file I am using: coredns.conf.txt

The zone files being served have about 100,000 records.

Thanks,

Brad

tmthrgd commented 6 years ago

Can you use list to see which lines are allocating? It would also be handy to know how many running goroutines you have, because they could be leaking.

ghost commented 6 years ago

It crashed again. I just restarted it. Let it grow awhile and you'll see more meaningful data. This is IPv6 only:

go tool pprof http://rpz-master.aws.cloud.iso.vt.edu:6060/debug/pprof/heap

tmthrgd commented 6 years ago

Profiles would be very handy, I'm getting this error though:

$ go tool pprof http://rpz-master.aws.cloud.iso.vt.edu:6060/debug/pprof/heap
Fetching profile over HTTP from http://rpz-master.aws.cloud.iso.vt.edu:6060/debug/pprof/heap
http://rpz-master.aws.cloud.iso.vt.edu:6060/debug/pprof/heap: Get http://rpz-master.aws.cloud.iso.vt.edu:6060/debug/pprof/heap: dial tcp [2600:1f16:9bf:9b02:e15d:256a:8a1a:5c54]:6060: connect: network is unreachable
failed to fetch any source profiles
tmthrgd commented 6 years ago

Oh it seems to be IPv6 only and I'm still stuck in the past with IPv4. Let's see if I can get around that.

tmthrgd commented 6 years ago

What happens if you apply a patch like this to scan.go in miekg/dns:

 // The text "; this is comment" is returned in Token.Comment. Comments inside the
 // RR are discarded. Comments on a line by themselves are discarded too.
 func ParseZone(r io.Reader, origin, file string) chan *Token {
-   return parseZoneHelper(r, origin, file, nil, 10000)
+   return parseZoneHelper(r, origin, file, nil, 200000)
 }

 func parseZoneHelper(r io.Reader, origin, file string, defttl *ttlState, chansize int) chan *Token {
tmthrgd commented 6 years ago

Do you see any "Parsing zone" messages printed to the log or stderr?

Edit: no you won't because they're being hidden.

tmthrgd commented 6 years ago

Is the zone file static? If so, you can temporarily avoid this problem by setting reload to 0 or by adding no_reload if you're not building from master.

There is a memory leak and it's pretty bad, and not all that easily fixed, but at least I do understand what's happing now. I'll detail tomorrow when I get a chance.

ghost commented 6 years ago

Thanks Tom. Yes, I already noticed that I could set reload to 0 to fix this, but that's the feature this application actually needs. When I used BIND for this task, I manually reloaded the zone. The zone data could change at anytime, upon which I need coreDNS to re-read the zone files and send notifies.

tmthrgd commented 6 years ago

@w8rbt In that case, I’d suggest trying to build coredns with the patch above. As long as that last number is larger than the number of lines in the zone file, your leak should be (temporarily) eliminated.

ghost commented 6 years ago

OK, thanks again @tmthrgd, I've rebuilt with the change. Will see how it does over the weekend. Appreciate your help.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

Do you see any "Parsing zone" messages printed to the log or stderr?

I ran into this myself: https://miek.nl/2017/november/17/switching-to-vodafone-uncovered-a-bug/

Is the zone valid, might be a similar thing (although I fix that one)

/Miek

-- Miek Gieben

ghost commented 6 years ago

Thanks @miekg, I guess that depends on what you mean by valid. It's a block list zone used by our recursive DNS resolvers. They run Response Policy Zone (RPZ) to redirect clients to a 'safe' host when the clients request something in the block list zone (malware). It's authoritative for the zone, but it's also a bunch of lies. So in some ways it is valid and in other ways it is not, but I'm not sure that matters in this case. Having enough channels for the records seemed to have fixed it. Would be nice if that argument could be variable and change on reload as needed.

miekg commented 6 years ago

Valid means the text representation is valid. Little semantic checks are made in the parser

On Fri, 12 Oct 2018, 17:23 Richard B Tilley (Brad), < notifications@github.com> wrote:

Thanks @miekg https://github.com/miekg, I guess that depends on what you mean by valid. It's a block list zone used by our recursive DNS resolvers. They run Response Policy Zone (RPZ) to redirect clients to a 'safe' host when the clients request something in the block list zone (malware). It's authoritative for the zone, but it's also a bunch of lies. So in some ways it is invalid and in other ways it is not, but I'm not sure that matters in this case. Having enough channels for the records seemed to have fixed it. Would be nice if that argument could be variable and change on reload as needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/miekg/dns/issues/786#issuecomment-429382272, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVkWzTGQTYnLEWpgds0A_PTYfZ3aBCgks5ukMH6gaJpZM4XYXca .

ghost commented 6 years ago

OK, in that case, then yes, the records are indeed valid. There are 100,000 or so and they look like this:

bad.site.xyz IN CNAME redirect-host.iso.vt.edu. *.bad.site.xyz IN CNAME redirect-host.iso.vt.edu. ...

tmthrgd commented 6 years ago

Could you also test with coredns/coredns#2194? That should solve the issue regardless of zonefile size.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

Could you also test with coredns/coredns#2194? That should solve the issue regardless of zonefile size.

It's likely(-er) that the bug is in coredns, than in this lib. As the CoreDNS code is way younger than the code here.

/Miek

-- Miek Gieben

miekg commented 6 years ago

I think making the channel buffered was a mistake. I recall I did that to make things faster :/ (I was young and foolish at the time)

tmthrgd commented 6 years ago

The buffered chan only served to hide this bug for so long. It only occurs with more than 10,000 records, and changing the size of the chan only pushes the problem further down the road while wasting more memory. It’s quite an expensive API overall with it’s two goroutines.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

The buffered chan only served to hide this bug for so long. It only occurs with more than 10,000 records, and changing the size of the chan only pushes the problem further down the road while wasting more memory. It’s quite an expensive API overall with it’s two goroutines.

hmm. yeah. This def. needs some work.

miekg commented 6 years ago

If we're touching this we should look at ragel. I had a ragel parser in the go 0.6 days but dropped it because ragel at the time didn't support go 1.0. that has been fixed. Should also result in a faster parser.

On Sat, 13 Oct 2018, 13:34 Miek Gieben, miek@miek.nl wrote:

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

The buffered chan only served to hide this bug for so long. It only occurs with more than 10,000 records, and changing the size of the chan only pushes the problem further down the road while wasting more memory. It’s quite an expensive API overall with it’s two goroutines.

hmm. yeah. This def. needs some work.

ghost commented 6 years ago

Thanks @tmthrgd and @miekg. The quick fix patch is working well. No more memory leak. I will test the proper patch next week sometime. Hard for me to believe that I'm the first to hit this bug. Many authoritative zones have millions of records.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

Thanks @tmthrgd and @miekg. The quick fix patch is working well. No more memory leak. I will test the proper patch next week sometime. Hard for me to believe that I'm the first to hit this bug. Many authoritative zones have millions of records.

Not many will be using coredns (yet) is my guess.

I'll merge the quick fix on the coredns' side.

miekg commented 6 years ago

Nice. I've actually checked in the original ragel code:

643ee5f5 - Add privatekey parser that does not use Ragel (7 years ago) Miek Gieben 8bfa46de - Farewall Ragel (7 years ago) Miek Gieben be27aa51 - Farewall Ragel (7 years ago) Miek Gieben 11476703 - Farewall Ragel (7 years ago) Miek Gieben 70ab04bd - Farewall Ragel (7 years ago) Miek Gieben 9d7dfc3d - Remove types.rl, we dont use Ragel anymore (7 years ago) Miek Gieben 981f1853 - First stab at parsing .priv key files with Ragel (7 years ago) Miek Gieben 023d3ce3 - Add generated parser so that people without Ragel can use it (7 years ago) Miek Gieben 31c9cf65 - Going to use Ragel (not lex/yacc) (7 years ago) Miek Gieben

Redoing this probably break backward compat which we do rarely, might be an opportunity to add some other things as well

tmthrgd commented 6 years ago

I want to propose the following API to replace (and deprecate) ParseZone:

type ZoneParser struct { ... }

func NewZoneParser(r io.Reader, origin, file string) *ZoneParser { ... }

func (zp *ZoneParser) Next() (RR, bool) { ... }

func (zp *ZoneParser) Err() error { ... }

func (zp *ZoneParser) Comment() string { ... }

The basic idea is that you call Next() in a for loop until it returns false, and then you check Err() for any parsing or IO errors. No goroutines, no leaks. It's pretty easy to convert the existing implementation as well.

I think the Comment() string method works better than having it returned from Next() as I presume most people are just interested in the RRs and it's still easy to get at the comment for anyone who needs them.

The existing ParseZone API can be reworked to use this internally, although you can't (really) avoid the need to drain the channel afterwards.

@miekg Thoughts?

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

I want to propose the following API to replace (and deprecate) ParseZone:

type ZoneParser struct { ... }

func NewZoneParser(r io.Reader, origin, file string) *ZoneParser { ... }

func (zp *ZoneParser) Next() (RR, bool) { ... }

func (zp *ZoneParser) Err() error { ... }

func (zp *ZoneParser) Comment() string { ... }

The basic idea is that you call Next() in a for loop until it returns false, and then you check Err() for any parsing or IO errors. No goroutines, no leaks. It's pretty easy to convert the existing implementation as well.

bit like text.Scanner which is also a nice interface.

I think the Comment() string method works better than having it returned from Next() as I presume most people are just interested in the RRs and it's still easy to get at the comment for anyone who needs them.

The existing ParseZone API can be reworked to use this internally, although you can't (really) avoid the need to drain the channel afterwards.

@miekg Thoughts?

I like it. Having two api for the time being is not bad, and we can think of a backward breaking release (now with versioning that also isn't too bad)

/Miek

-- Miek Gieben

miekg commented 5 years ago

[ Quoting notifications@github.com in "Re: [miekg/dns] Memory Leak in gith..." ]

Closed #786 via #794.

I'm about to kick of another miekg/dns release and use the new code here as well.

ghost commented 5 years ago

Thanks @tmthrgd! You've been awesome. Really appreciate the fix ;)

Brad