Closed speed47 closed 8 years ago
Thanks for reporting this, it'll be fixed in the next release.
I've just tried v0.7.1 and I have bad news: it didn't just crash, it actually almost crashed my entire machine. But the kernel succeeded in killing svfs after 10 minutes of oom condition : [3219249.984164] Killed process 20716 (svfs) total-vm:2211136kB, anon-rss:1914140kB, file-rss:4kB. I suspect a goroutine number explosion because of some global deadlock due to the new locks you added to fix this issue. I'll try to get more details on this issue between the moment it starts to get out of hand and before it brings the kernel on its knees and makes the host unusable.
I managed to capture a moment with almost 2000 goroutines.
while : ; do echo $(ps faxu | awk '/ svfs/ {if($6>10000){print $6}}') mb ; nb=$(curl -sL localhost:12131/debug/pprof/goroutine?debug=1 | awk '/gorou/ {print $4}') ; echo $nb goroutines; [ $nb -gt 400 ] || continue ; echo WOW $nb ; for i in block goroutine heap threadcreate ; do curl -sL http://localhost:12131/debug/pprof/$i?debug=1 > svfs.$i ; done ; pkill svfs ; break ; done
Usually it's around 30 goroutines, but sometimes the number of goroutines seems to explode very quickly, along with the RAM usage.
Among the 2K goroutines, more than 1800 are actually executing this:
# 0x531bc9 github.com/ovh/svfs/svfs.(*Lister).AddTask.func1+0x69 /home/admin/.gvm/pkgsets/go1.6.2/global/src/github.com/ovh/svfs/svfs/lister.go:43
pprof debug files attached.
svfs.block.txt svfs.goroutine.txt svfs.heap.txt svfs.threadcreate.txt
I don't think this behavior is caused by the previous fix. I would say the directory lister has not enough workers and listing tasks are piling up due to asynchronous task add.
Just to get some context, what are you mount options ? How many files/directories do you have in your container(s) ?
Thanks for bringing these pprof extracts.
This is an rsync of 100G of photos, with 30K files in 900 directories. There are no "huge" directories with thousands of files. Mount options from fstab:
noauto,users,hubic_auth=X,hubic_token=Y,container=default,uid=speed,extra_attr,hubic_times,profile_addr=127.0.0.1:12131,profile_cpu=/dev/shm/sfvs.cpu,profile_ram=/dev/shm/svfs.ram
profile_* options were only added for the above bug report, the crash happens without.
To reproduce:
rsync -rW --inplace --size-only --progress localdir/ hubicdir/
and running a concurrent du -shc hubicdir/
at the same time seems to trigger the issue more easily
There's indeed something wrong going on.
Below is the GC activity while running while : ; do tree container; done
. Heap is progressively draining a large amount of memory from the operation system and as heap size continues growing with allocation spikes, it's never returned to the OS due to the way go GC works.
I'll take a look at the go routine count.
I tracked this down to be mainly related to overallocation of receive message buffers in the FUSE library (16 MiB by default). With 353aba9, memory consumption drops by a ratio of 20 and CPU usage by 3. Some work still need to be done in svfs to reduce nodes structs sizes and optimize caching.
Context
Steps to reproduce this issue :
Results you expected :
no crash
Results you observed :
crash
Debug log :
This is a golang trace. You seem to miss a mutex when writing to some map. The problem may have existed since a long time, as runtime data race detection has been added to golang in v1.6.