Closed timtadh closed 9 years ago
No major work is going to be done on the 1.4 branch, so your best bet is making your program work on tip (Go 1.5) and discussing from there.
@bradfitz Ok. I am definitely willing to do that but I do have any idea where errors such as:
2015/06/02 13:11:50 Size is too large. Cannot allocate an array that big
goroutine 127 [running]:
github.com/timtadh/fs2/errors.Errorf(0x6477b0, 0x34, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/errors/error.go:16 +0x8c
github.com/timtadh/fs2/bptree.(*Varchar).Alloc(0xc2087f00c0, 0xdeaddeaddeaddead, 0x0, 0x0, 0x0)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/varchar.go:189 +0x89
github.com/timtadh/fs2/bptree.(*BpTree).newVarcharKey(0xc2084a2640, 0x72000, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:196 +0x1a1
github.com/timtadh/fs2/bptree.(*BpTree).leafInsert(0xc2084a2640, 0x72000, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc2081ba040, 0x8, 0x8, 0x4b5994, 0xc2084a2640, ...)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:214 +0xb5
github.com/timtadh/fs2/bptree.(*BpTree).insert(0xc2084a2640, 0x72000, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc2081ba040, 0x8, 0x8, 0x0, 0x0, ...)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:102 +0x1a4
github.com/timtadh/fs2/bptree.(*BpTree).internalInsert(0xc2084a2640, 0x91000, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc2081ba040, 0x8, 0x8, 0x0, 0x0, ...)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:134 +0x163
github.com/timtadh/fs2/bptree.(*BpTree).insert(0xc2084a2640, 0x91000, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc2081ba040, 0x8, 0x8, 0x0, 0xc2085ecea8, ...)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:100 +0x131
github.com/timtadh/fs2/bptree.(*BpTree).Add(0xc2084a2640, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc2081ba040, 0x8, 0x8, 0x0, 0x0)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fs2/bptree/insert.go:54 +0x311
github.com/timtadh/fsm/store.(*Fs2BpTree).Add(0xc2084a2660, 0xc20842c740, 0xdeaddeaddeaddead, 0x4c, 0xc208350380)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fsm/store/fs.go:136 +0x106
github.com/timtadh/fsm/mine.(*Miner).collector(0xc208018410, 0x7f0ea41b7640, 0xc2084a2660, 0xc20822a480)
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fsm/mine/mine.go:331 +0xf2
created by github.com/timtadh/fsm/mine.(*Miner).makeCollectors
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fsm/mine/mine.go:343 +0x1fb
are comming from. The error here is some how the bptree.(*Varchar).Alloc()
method is getting 0xdeaddeaddeaddead
as the size to allocate. That seems to be a "magic" value. All of my tests pass on 1.5 so I am really not sure why this happens. I should note that I do interface with some native code via cgo if that make any difference. Perhaps this error is coming from issues with the new garbage collector?
That means that you have a dangling pointer. It's gotten easier to get those if you allocate memory in cgo and use a finalizer to free it. Or it could be cgo related in some other way.
$ go version
go version devel +70cf735 Tue Jun 2 13:55:40 2015 +0000 linux/amd64
@bradfitz / @ianlancetaylor Something funky is definitely going on. The
pointer that was marked at 0xdead
comes from a channel. Basically the
code looks like this:
type labelGraph struct {
label []byte
sg *goiso.SubGraph
}
type Collector struct {
ch chan *labelGraph
tree store.SubGraphs
}
func (c *Collector) send(sg *SubGraph) {
label := sg.Label()
c.ch<-&labelGraph{label, sg}
}
func (c *Collector) collect() {
for lg := range c.ch {
tree.Add(lg.label, lg.sg)
}
}
func (c *Collector) close() {
close(c.ch)
}
func otherThread() {
go c.collect()
for sg := range candidates {
c.send(sg)
}
c.close()
}
The error was coming from deep inside the implementation of
tree.Add()
.
In send(sg)
I added a check to make sure the values I was putting in
were sane. They always were. I added the same check inside of Add()
.
There the check would fail. I pushed the check up one level to be inside
of collect. It also failed there.
So basically, I was putting good values into the channel but getting dead values out. Super weird. If I turn off using the mmap'ed B+Tree and use a regular datastructure the errors go away.
I tried doing something which should have had no effect, I added buffering to the channels.
ch: make(chan *labelGraph, 1)
This made the error stop. Which is great. However, now I get a new even more mysterious error:
unexpected fault address 0xc208246c90
fatal error: fault
[signal 0xb code=0x2 addr=0xc208246c90 pc=0xc208246c90]
goroutine 200 [running]:
runtime.throw(0x602db0, 0x5)
/home/hendersont/srcs/go/src/runtime/panic.go:527 +0x96 fp=0xc208ef3dd8 sp=0xc208ef3dc0
runtime.sigpanic()
/home/hendersont/srcs/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc208ef3e28 sp=0xc208ef3dd8
created by github.com/timtadh/fsm/mine.(*Miner).filterAndExtend
/home/hendersont/stuff/research/fsm/src/github.com/timtadh/fsm/mine/mine.go:235 +0x8e
Which looks like it might be a segmentation fault? These do not happen every run. The other error was happening consistently. This error seems to only happen when I use anonomous memory maps. If they are backed by files it does not occur (well has not yet occurred).
Whatever my program's problems are, the fragmentation problem in the runtime
also exists in 1.5. For instance after a successful run I get this with
GODEBUG=gctrace=1
gc #13181 @266.838s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+1 ms cpu, 4->4->3 MB, 4 MB goal, 4 P
gc #13182 @266.869s 18%: 0+0+12+8+0 ms clock, 0+0+0+4/8/0+1 ms cpu, 4->6->4 MB, 4 MB goal, 4 P
gc #13183 @266.885s 18%: 0+0+0+13+0 ms clock, 0+0+0+0/13/0+2 ms cpu, 4->4->3 MB, 7 MB goal, 4 P
gc #13184 @266.905s 18%: 0+0+0+11+0 ms clock, 0+0+0+0/11/0+1 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc #13185 @266.929s 18%: 0+0+5+9+0 ms clock, 0+0+0+2/9/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13186 @266.949s 18%: 0+0+0+7+5 ms clock, 0+0+0+0/7/0+23 ms cpu, 4->4->2 MB, 5 MB goal, 4 P
gc #13187 @266.988s 18%: 0+0+0+32+0 ms clock, 0+0+0+11/12/0+2 ms cpu, 4->7->4 MB, 4 MB goal, 4 P
gc #13188 @267.012s 18%: 0+0+3+14+0 ms clock, 0+0+0+0/14/0+2 ms cpu, 4->5->3 MB, 8 MB goal, 4 P
gc #13189 @267.033s 18%: 0+0+0+13+0 ms clock, 0+0+0+0/13/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13190 @267.050s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+2 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc #13191 @267.082s 18%: 0+0+3+16+0 ms clock, 3+0+0+8/16/0+0 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13192 @267.095s 18%: 0+0+0+7+0 ms clock, 0+0+0+0/7/0+2 ms cpu, 4->4->2 MB, 6 MB goal, 4 P
gc #13193 @267.127s 18%: 4+1+4+12+0 ms clock, 18+1+0+0/12/0+1 ms cpu, 4->6->3 MB, 5 MB goal, 4 P
gc #13194 @267.145s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+0 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc #13195 @267.170s 18%: 0+0+1+11+0 ms clock, 2+0+0+0/11/0+2 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13196 @267.192s 18%: 0+0+1+10+0 ms clock, 0+0+0+0/10/0+2 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13197 @267.215s 18%: 0+0+0+14+0 ms clock, 0+0+0+0/14/0+1 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc #13198 @267.240s 18%: 0+0+3+10+0 ms clock, 3+0+0+0/10/0+2 ms cpu, 4->5->3 MB, 4 MB goal, 4 P
gc #13199 @267.260s 18%: 0+0+0+14+0 ms clock, 0+0+0+1/14/0+2 ms cpu, 4->4->3 MB, 4 MB goal, 4 P
gc #13200 @267.287s 18%: 0+0+6+9+0 ms clock, 0+0+0+0/9/0+1 ms cpu, 4->5->3 MB, 4 MB goal, 4 P
gc #13201 @267.318s 18%: 0+0+10+13+0 ms clock, 0+0+0+0/13/0+1 ms cpu, 4->5->4 MB, 5 MB goal, 4 P
gc #13202 @267.331s 18%: 0+0+0+9+0 ms clock, 0+0+0+0/9/0+2 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc #13203 @267.367s 18%: 2+0+7+13+0 ms clock, 11+0+0+0/13/0+0 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13204 @267.384s 18%: 0+0+0+9+0 ms clock, 0+0+0+0/9/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P
gc #13205 @267.416s 18%: 0+0+7+12+0 ms clock, 0+0+0+3/12/0+1 ms cpu, 4->6->3 MB, 5 MB goal, 4 P
But the resident size was continuing to grow at a healthy clip!
This image shows the size of the memory mapped files (as files rather than anonmous mappings) versus the size of the heap which is growing (and continuing to grow actually out pacing the growth rate of the memory maps).
It looks like
I am interested in helping solve both problems. Let me know how I can help.
Can you give us a standalone test case that we can use to reproduce the problem?
The error you are describing is consistent with what I suggested earlier: something is freeing C memory when Go thinks it is still active. Where is sg allocated?
Setting milestone to 1.5 to make sure we figure out why the program crashes with 1.5, if possible.
The Go GCed heap uses a segregated size heap so if the sizes of the objects being freed are similar to those been allocated fragmentation should not be a problem.
On Tue, Jun 2, 2015 at 5:31 PM, Tim Henderson notifications@github.com wrote:
$ go version go version devel +70cf735 Tue Jun 2 13:55:40 2015 +0000 linux/amd64
@bradfitz https://github.com/bradfitz / @ianlancetaylor https://github.com/ianlancetaylor Something funky is definitely going on. The pointer that was marked at 0xdead comes from a channel. Basically the code looks like this:
type labelGraph struct { label []byte sg goiso.SubGraph } type Collector struct { ch chan labelGraph tree store.SubGraphs } func (c Collector) send(sg SubGraph) { label := sg.Label() c.ch<-&labelGraph{label, sg} } func (c Collector) collect() { for lg := range c.ch { tree.Add(lg.label, lg.sg) } } func (c Collector) close() { close(c.ch) } func otherThread() { go c.collect() for sg := range candidates { c.send(sg) } c.close() }
The error was coming from deep inside the implementation of tree.Add().
In send(sg) I added a check to make sure the values I was putting in were sane. They always were. I added the same check inside of Add(). There the check would fail. I pushed the check up one level to be inside of collect. It also failed there.
So basically, I was putting good values into the channel but getting dead values out. Super weird. If I turn off using the mmap'ed B+Tree and use a regular datastructure the errors go away.
I tried doing something which should have had no effect, I added buffering to the channels.
ch: make(chan *labelGraph, 1)
This made the error stop. Which is great. However, now I get a new even more mysterious error:
unexpected fault address 0xc208246c90 fatal error: fault [signal 0xb code=0x2 addr=0xc208246c90 pc=0xc208246c90]
goroutine 200 [running]: runtime.throw(0x602db0, 0x5) /home/hendersont/srcs/go/src/runtime/panic.go:527 +0x96 fp=0xc208ef3dd8 sp=0xc208ef3dc0 runtime.sigpanic() /home/hendersont/srcs/go/src/runtime/sigpanic_unix.go:27 +0x2ae fp=0xc208ef3e28 sp=0xc208ef3dd8 created by github.com/timtadh/fsm/mine.(*Miner).filterAndExtend /home/hendersont/stuff/research/fsm/src/github.com/timtadh/fsm/mine/mine.go:235 +0x8e
Which looks like it might be a segmentation fault? These do not happen every run. The other error was happening consistently. This error seems to only happen when I use anonomous memory maps. If they are backed by files it does not occur (well has not yet occurred).
Whatever my program's problems are, the fragmentation problem in the runtime also exists in 1.5. For instance after a successful run I get this with GODEBUG=gctrace=1
gc #13181 @266.838s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+1 ms cpu, 4->4->3 MB, 4 MB goal, 4 P gc #13182 @266.869s 18%: 0+0+12+8+0 ms clock, 0+0+0+4/8/0+1 ms cpu, 4->6->4 MB, 4 MB goal, 4 P gc #13183 @266.885s 18%: 0+0+0+13+0 ms clock, 0+0+0+0/13/0+2 ms cpu, 4->4->3 MB, 7 MB goal, 4 P gc #13184 @266.905s 18%: 0+0+0+11+0 ms clock, 0+0+0+0/11/0+1 ms cpu, 4->4->3 MB, 5 MB goal, 4 P gc #13185 @266.929s 18%: 0+0+5+9+0 ms clock, 0+0+0+2/9/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13186 @266.949s 18%: 0+0+0+7+5 ms clock, 0+0+0+0/7/0+23 ms cpu, 4->4->2 MB, 5 MB goal, 4 P gc #13187 @266.988s 18%: 0+0+0+32+0 ms clock, 0+0+0+11/12/0+2 ms cpu, 4->7->4 MB, 4 MB goal, 4 P gc #13188 @267.012s 18%: 0+0+3+14+0 ms clock, 0+0+0+0/14/0+2 ms cpu, 4->5->3 MB, 8 MB goal, 4 P gc #13189 @267.033s 18%: 0+0+0+13+0 ms clock, 0+0+0+0/13/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13190 @267.050s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+2 ms cpu, 4->4->3 MB, 5 MB goal, 4 P gc #13191 @267.082s 18%: 0+0+3+16+0 ms clock, 3+0+0+8/16/0+0 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13192 @267.095s 18%: 0+0+0+7+0 ms clock, 0+0+0+0/7/0+2 ms cpu, 4->4->2 MB, 6 MB goal, 4 P gc #13193 @267.127s 18%: 4+1+4+12+0 ms clock, 18+1+0+0/12/0+1 ms cpu, 4->6->3 MB, 5 MB goal, 4 P gc #13194 @267.145s 18%: 0+0+0+10+0 ms clock, 0+0+0+0/10/0+0 ms cpu, 4->4->3 MB, 5 MB goal, 4 P gc #13195 @267.170s 18%: 0+0+1+11+0 ms clock, 2+0+0+0/11/0+2 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13196 @267.192s 18%: 0+0+1+10+0 ms clock, 0+0+0+0/10/0+2 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13197 @267.215s 18%: 0+0+0+14+0 ms clock, 0+0+0+0/14/0+1 ms cpu, 4->4->3 MB, 5 MB goal, 4 P gc #13198 @267.240s 18%: 0+0+3+10+0 ms clock, 3+0+0+0/10/0+2 ms cpu, 4->5->3 MB, 4 MB goal, 4 P gc #13199 @267.260s 18%: 0+0+0+14+0 ms clock, 0+0+0+1/14/0+2 ms cpu, 4->4->3 MB, 4 MB goal, 4 P gc #13200 @267.287s 18%: 0+0+6+9+0 ms clock, 0+0+0+0/9/0+1 ms cpu, 4->5->3 MB, 4 MB goal, 4 P gc #13201 @267.318s 18%: 0+0+10+13+0 ms clock, 0+0+0+0/13/0+1 ms cpu, 4->5->4 MB, 5 MB goal, 4 P gc #13202 @267.331s 18%: 0+0+0+9+0 ms clock, 0+0+0+0/9/0+2 ms cpu, 4->4->3 MB, 5 MB goal, 4 P gc #13203 @267.367s 18%: 2+0+7+13+0 ms clock, 11+0+0+0/13/0+0 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13204 @267.384s 18%: 0+0+0+9+0 ms clock, 0+0+0+0/9/0+1 ms cpu, 4->5->3 MB, 5 MB goal, 4 P gc #13205 @267.416s 18%: 0+0+7+12+0 ms clock, 0+0+0+3/12/0+1 ms cpu, 4->6->3 MB, 5 MB goal, 4 P
But the resident size was continuing to grow at a healthy clip!
[image: resource-usage] https://cloud.githubusercontent.com/assets/38620/7947587/0cbd18dc-094c-11e5-923e-03baaf5e7e8c.png
This image shows the size of the memory mapped files (as files rather than anonmous mappings) versus the size of the heap which is growing (and continuing to grow actually out pacing the growth rate of the memory maps). Conclusion
It looks like
1.
There might be a bug with respect to the unbuffered channels. Difficult for me to tell if it is your bug or mine. Let me know if you have any suggestions for sorting that out. 2.
The fragmentation is not fixed by the new collector. The target heap size is fairly constantly around 4-8 MB but the heap grows at an unbounded rate.
I am interested in helping solve both problems. Let me know how I can help.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/11035#issuecomment-108104091.
@RLH w.r.t.
so if the sizes of the objects being freed are similar to those been allocated fragmentation should not be a problem.
In Frequent Subgraph Mining the sizes of the subgraphs (and therefore the allocations) grow as the program proceeds. This may be something of a pathological case for a memory allocator. There are two things going on
So, what I think tends to happen is the space where the subgraphs are allocated
a. Becomes too small during the next round
b. Gets fragmented by other allocations
These are both hypothesis. I do not have data to back up those claims. I will think on how to collect it.
@ianlancetaylor I can provide the program and a data set to run it on immediately (as well as an appropriate configuration). Providing a minimal test case may be harder as I am not sure what aspect of the program is causing the segmentation fault/dangling pointer. I will try and make a small program today that demonstrates it. I will also give a shot at making a program which replicates the allocation behavior of my program.
@ianlancetaylor Also in response to your question
The error you are describing is consistent with what I suggested earlier: something is freeing C memory when Go thinks it is still active. Where is sg allocated?
Subgraphs are allocated in 2 functions. Both are normal Go allocations. This part does not deal with unsafe code.
Thanks for your help!
Saw this, wanted to be sure that the contents of the mmap'd B+Tree were visible/known to the GC:
So basically, I was putting good values into the channel but getting
dead values out. Super weird. If I turn off using the mmap'ed B+Tree and
use a regular datastructure the errors go away
I'm still new to go and the library, so I could be barking up the wrong tree. But if I'm not, that's a very likely cause of 0xdeaddeaddead -- hide pointers from GC, run GC, it reclaims the storage, when you return pointers from your hiding place, they reference junk.
@dr2chase The values being put into the channel are normal Go values. See my comment above to @ianlancetaylor on where the subgraphs are constructed.
For clarity, when things get put into the B+Tree they have to first be seralized into a byte slice. The byte slice is ultimately copied into the mmap area. In code:
In the case that keys/values are fixed size instead of variable sized they get copied here:
@ianlancetaylor @bradfitz Here is how to reproduce the bug
Running fsm
to reproduce the bug.
Prerequisites:
gcc 4.8.2
. Later versions are
also known to work. The C++ compiler is for compiling bliss
in
https://github.com/timtadh/goiso .Note, this program is part of my research for my PhD. It is not super friendly to run and only works on Linux. Sorry for the inconvenience this may cause you!
Steps:
$ git clone git@github.com:timtadh/fsm
$ git checkout 2510774dfe2a578e94bc8ba52dbc3e467af001c4
$ cd fsm
$ git submodule init
$ git submodule update
$ cat .activate
export GOPATH=$(pwd):$GOPATH
export PATH=$(pwd)/bin:$PATH
$ source .activate
$ go install github.com/timtadh/fsm
At this point you should have a working fsm
program. Note that you may
get some gcc warnings about non-static data initializers.
Test that fsm is working
$ fsm -h
you should see a helpful message on how to use fsm.
Now you need to download the data http://tyrocine.case.edu/static/wiki-sort.gz . If you use wget you may see a certificate error. This is due to the way CWRU signed the cert (this is a server I run to host student code). If you download via Firefox or Chrome the cert should correctly validate.
To run:
$ fsm -s 5 -m 8 --mem-profile=mem.pprof --mem-cache -o /tmp/wiki ./wiki-sort.gz
See the -h
flag for definitions of each of the flags used. This
program is a research project so it is a bit rough around the edges!
The program should run for a while and then fail with a panic such as:
panic: Key was a bad value -2401018187971961171 [] 0xc208455fa0
0xc208965200
goroutine 110 [running]:
github.com/timtadh/fsm/mine.(*Miner).collector(0xc2084d8230, 0x7fb150290510, 0xc208449400, 0xc208f780c0)
/tmp/fsm/src/github.com/timtadh/fsm/mine/mine.go:329 +0x321
created by github.com/timtadh/fsm/mine.(*Miner).makeCollectors
/tmp/fsm/src/github.com/timtadh/fsm/mine/mine.go:354 +0x1fb
goroutine 1 [chan receive, 1 minutes]:
main.main()
/tmp/fsm/src/github.com/timtadh/fsm/main.go:458 +0x232a
goroutine 17 [syscall, 1 minutes, locked to thread]:
runtime.goexit()
If you re-run with go 1.4.2 you should no longer see the crash.
Indeed, it does fail for me. I may not know enough about details of the GC to make quick sense of this, but it could be an educational adventure.
@dr2chase The interesting thing (at least to me) is it doesn't happen on every dataset. However, the wiki-sort graph seems to trigger it every single time. The fragmentation (which is what I hope I can get some feedback on) happens every time however.
@ianlancetaylor @dr2chase @bradfitz @RLH This commit https://github.com/timtadh/fsm/commit/0f8739538c7179d06f6e6c827860e779f9b0a368 stops the errors I was seeing in master from occurring. I have not seen any segmentation faults today, so perhaps that was fixed in a different commit.
note:
$ go version
go version devel +0f6da89 Wed Jun 3 20:43:51 2015 +0000 linux/amd64
As I note in the commit message, I should not have to buffer the channel. My guess is there is some sort of race condition where things going through an unbuffered channel are briefly "invisible" to the garbage collector. I think that exploring that hypothesis for the error I was seeing is the best road.
Any thoughts of the fragmentation issue would be appreciated. It definitely seems to be a real issue.
We're exploring that hypothesis, because it looks suspiciously like that to us (me, Rick) too. I'm collecting 4000 failure logs tonight. On Jun 3, 2015 6:51 PM, "Tim Henderson" notifications@github.com wrote:
@ianlancetaylor https://github.com/ianlancetaylor @dr2chase https://github.com/dr2chase @bradfitz https://github.com/bradfitz @RLH https://github.com/RLH This commit timtadh/fsm@0f87395 https://github.com/timtadh/fsm/commit/0f8739538c7179d06f6e6c827860e779f9b0a368 stops the errors I was seeing in master from occurring. I have not seen any segmentation faults today, so perhaps that was fixed in a different commit.
note:
$ go version go version devel +0f6da89 Wed Jun 3 20:43:51 2015 +0000 linux/amd64
As I note in the commit message, I should not have to buffer the channel. My guess is there is some sort of race condition where things going through an unbuffered channel are briefly "invisible" to the garbage collector. I think that exploring that hypothesis for the error I was seeing is the best road.
Any thoughts of the fragmentation issue would be appreciated. It definitely seems to be a real issue.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/11035#issuecomment-108635886.
Saw this in chan.go, thought "Oh really?":
if elem.kind&kindNoPointers != 0 || size == 0 {
// Allocate memory in one call.
// Hchan does not contain pointers interesting for GC in this case:
// buf points into the same allocation, elemtype is persistent.
// SudoG's are referenced from their owning thread so they can't be collected.
@RLH
Damien's email seems #11053 seems more interesting.
https://github.com/golang/go/issues/11053
On Thu, Jun 4, 2015 at 8:02 AM, dr2chase notifications@github.com wrote:
Saw this in chan.go, thought "Oh really?":
if elem.kind&kindNoPointers != 0 || size == 0 { // Allocate memory in one call. // Hchan does not contain pointers interesting for GC in this case: // buf points into the same allocation, elemtype is persistent. // SudoG's are referenced from their owning thread so they can't be collected.
— Reply to this email directly or view it on GitHub https://github.com/golang/go/issues/11035#issuecomment-108861207.
I can reproduce the memory growth you are seeing (with 1.4.2). But it is not the Go heap. The Go heap remains constant-sized at around 8MB, as you can see in your gctrace=1 output. Something else is growing.
Diffing two instances of /proc/pid/maps, I see lots of instances like this:
< 7f6670000000-7f667238e000 rw-p 00000000 00:00 0
< 7f667238e000-7f6674000000 ---p 00000000 00:00 0
---
> 7f6670000000-7f66724e7000 rw-p 00000000 00:00 0
> 7f66724e7000-7f6674000000 ---p 00000000 00:00 0
which means larger portions of existing anonymous mappings are acquiring backing store. In this case, ~1.4MB extra in a 64MB mapping.
@randall77 Thanks for the help!
I am going to investigate this today. I wrote most of the code that my system uses but I am making use of a C++ library for graph isomorphism that I did not write. My initial guess is a memory management error there.
One note, if you run the program with --cache=/path/to/tmp/dir/
it
will use file backed maps instead of anonymous maps for the B+Trees.
When I look at diffs of /proc/<pid>/maps
using that option. I see the
memory maps for the B+Trees moving around (as I expect because a use
mremap
when necessary to expand the mapping with MREMAP_MAYMOVE
).
However, I see a bunch of other maps in there as well moving around. I
am not allocating those maps so there are two options:
Go is allocating them as part of its memory management scheme. I know
that by default the allocator on linux does this. Excerpt from
man malloc
Normally,
malloc()
allocates memory from the heap, and adjusts the size of the heap as required, usingsbrk(2)
. When allocating blocks of memory larger thanMMAP_THRESHOLD
bytes, the glibcmalloc()
implementation allocates the memory as a private anonymous mapping usingmmap(2)
.MMAP_THRESHOLD
is 128 kB by default, but is adjustable usingmallopt(3)
. Allocations performed usingmmap(2)
are unaffected by theRLIMIT_DATA
resource limit (seegetrlimit(2)
)
libbliss
which I use for graph isomorphism is doing a bad job
managing its memory. I think to test this I will write a C program to
create lots of graphs and run it under valgrind.Thanks for the insights.
@randall77 Looks like libbliss has a memory leak when computing canonical computations. I am going to chase that down for now. I will report if I get an more unexpected errors in Go 1.5. Thanks for all of your help everyone!
@randall77 @dr2chase @ianlancetaylor @RLH The memory leak in libbliss has been plugged and that seems to have solved the heap growth problem! Thank you all for helping me out. I have been getting a few inconsistent panics in a GC thread today however they have been too inconsistent to track down. If I can find a consistent way to trigger it I will open another bug. Feel free to rename or close this issue.
Thanks again. -Tim
It's suspected that some recent changes for concurrent GC may be involved in the panic. It reproduces quite nicely on my Linux box, and is robust in the face of debugging printlns, and my suspicion is that the reference is sometimes dropped during unbuffered channel send/receive.
Let's open a separate issue for the panic.
What version of Go are you using (
go version
)?What operating system and processor architecture are you using?
What did you do?
run https://github.com/timtadh/fsm on a large graph (see explanation below)
What did you expect to see?
memory usage remain relatively stable
What did you see instead?
Growth of memory throughout run of program, sometimes extremely rapid growth.
Explanation
I have been working on a frequent subgraph mining algorithm in Go. In case you are not familiar with frequent subgraph mining, it involves looking in a very large graph for repeated subgraphs. This is an exponential problem and can require a lot of memory.
To solve the memory issue, I wrote a memory mapped B+Tree fs2/bptree which stores the candidate subgraphs. The B+Tree works fine and generally has very little overhead in terms of using memory.
What does have overhead is generating the candidate graphs. Basically the process goes something like this:
In theory, the overall memory usage (outside of the B+Trees) should not grow in this process. However, it does grow. I believe it is due to the heap becoming fragmented. I am willing to do considerable legwork to both show this conclusively and to help fix the problem.
This may not be fixable
Before, we go into my reasons for believing this is a fragmentation issue, let me first not that this may not be fixable. The solutions I have outlined at the bottom of the post may be impractical. However, I think it would be worth trying to fix it.
In Depth
I have been profiling my program extensively for several months to try and get rid of memory allocations as much as possible. However, there are some limits to my ability to completely get rid of allocations during the mining process. It also becomes difficult to reason about ownership of particular bits of memory (like the subgraph objects) when they are flying around between different goroutines.
One piece of evidence I have gathered for this being a fragmentation problem is the statistics reported by
GODEBUG=gctrace=1
. Here is a snapshot after the program has been running for a while:One, thing to notice, the number of live objects stays relatively constant throughout the execution of the program. It stays in the neighborhood of 70k-80k on a 4 core machine. Inspecting the memory (as reported in /proc//maps) shows that the Go heap does continue to
grow. Allowing the program to run for a very long time essentially
causes the B+Trees memory maps to be evicted from memory to make space
for the ever growing heap.
Sources of Memory Allocations
As you can see, the sources of allocations are what you would expect:
canonSubGraph
)I am obviously experimenting with ways to reduce the allocations performed by these operations. However, at some point I may just reach the limits with what is easily achievable in Go and I will have to switch to either C++ or Rust (which did not really exist when I started this project in 2012).
Does Go 1.5 Help?
I compiled and ran master yesterday. Unfortunately, my program does not currently execute on master. I get lots of weird errors. Maybe this is because I do a lot of unsafe things in my B+Tree implementation? I don't know. I can post those errors if you want to see them but I think they are separate issue.
Possible Solutions
The best idea I can think of would be a
runtime
routine which would stop the world and do a heap compaction. This would never be triggered automatically (that would be catastrophic in say an http server). However, a program could callruntime.HeapCompact()
which would cause the mutator to stop and a full garbage collect + compaction to run.The way this could work is to have a second garbage collector which is either a copying-collector or a mark-and-compact collector. The basic idea of both is to find all of the current objects active at the time of collection. Then, to one by one move each object to a new location and rewrite all pointers into the object. This is obviously an expensive operations.
Drawbacks
Benefits
Other Solutions
Incremental
Incremental compaction during GC or allocations. This would mean that the heap would never be fully compacted however, it would limit the amount of work that is done and make the process fully automatic:
Drawbacks
Benefits
Improve the Allocator
The other idea is to simply improve the allocator to reduce the fragmentation in the first place. Programs, like mine, which cause fragmentation can be collected and their behavior can be studied to figure out if there are ways to reduce the fragmentation.
Drawbacks
Benefits
Conclusion
I believe Go currently has a heap fragmentation problem that impacts long running programs which do lots of small and medium sized allocations. A potential solution is to add a function
runtime.HeapCompact()
which would trigger a stop the world compaction.