sugarme / gotch

Go binding for Pytorch C++ API (libtorch)
Apache License 2.0
577 stars 45 forks source link

Cgo Memory Leak #132

Open nullbull opened 6 months ago

nullbull commented 6 months ago

We use gotch for our online services, but we find that the server's RSS, that is, the memory usage indicator, has been rising. I suspect it is a memory leak, and the program using pprof golang is only a few dozen M, but the RSS has been rising. Then I used many methods and finally found out through valgrind that there was indeed a cgo memory leak problem.

here is my test code

func main() {
    TestModel()
}

func TestModel() {
    N := 5
    m, _ := ts.ModuleLoad("test_full_save.pt")

    m.SetEval()
    for i := 0; i < N; i++ {
        tf := ts.MustRand([]int64{1, 7}, gotch.Float, gotch.CPU)
        res, _ := m.Forward(tf)
        defer res.MustDrop()
        defer tf.MustDrop()
    }
}

There is valgrind find memory leak informations, The following is the command executed. valgrind --leak-check=full ./model_test

==2506424== Memcheck, a memory error detector
==2506424== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==2506424== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==2506424== Command: ./model_test
==2506424== 
==2506424== Warning: set address range perms: large range [0x4a99000, 0x1a874000) (defined)
==2506424== Warning: set address range perms: large range [0x24ba9000, 0x44ba9000) (noaccess)
==2506424== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==2506424==          the SIGRT32 signal is used internally by Valgrind
==2506424== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==2506424==          the SIGRT32 signal is used internally by Valgrind
==2506424== Warning: client switching stacks?  SP change: 0x1fff000168 --> 0xc0000547d8
==2506424==          to suppress, use: --max-stackframe=687211890288 or greater
==2506424== Warning: client switching stacks?  SP change: 0xc000054778 --> 0x1fff0001f8
==2506424==          to suppress, use: --max-stackframe=687211890048 or greater
==2506424== Warning: client switching stacks?  SP change: 0x1fff0001f8 --> 0xc000054778
==2506424==          to suppress, use: --max-stackframe=687211890048 or greater
==2506424==          further instances of this message will not be shown.
==2506424== Conditional jump or move depends on uninitialised value(s)
==2506424==    at 0x5412B1: runtime.adjustframe (stack.go:575)
==2506424==    by 0x54B9CC: runtime.gentraceback (traceback.go:345)
==2506424==    by 0x541874: runtime.copystack (stack.go:932)
==2506424==    by 0x541DF6: runtime.newstack (stack.go:1112)
==2506424==    by 0x5554CA: runtime.morestack.abi0 (asm_amd64.s:570)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424== 
==2506424== Conditional jump or move depends on uninitialised value(s)
==2506424==    at 0x54109B: runtime.adjustpointer (stack.go:575)
==2506424==    by 0x54109B: runtime.adjustframe (stack.go:691)
==2506424==    by 0x54B9CC: runtime.gentraceback (traceback.go:345)
==2506424==    by 0x541874: runtime.copystack (stack.go:932)
==2506424==    by 0x541DF6: runtime.newstack (stack.go:1112)
==2506424==    by 0x5554CA: runtime.morestack.abi0 (asm_amd64.s:570)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424== 
==2506424== Conditional jump or move depends on uninitialised value(s)
==2506424==    at 0x5410A4: runtime.adjustpointer (stack.go:575)
==2506424==    by 0x5410A4: runtime.adjustframe (stack.go:691)
==2506424==    by 0x54B9CC: runtime.gentraceback (traceback.go:345)
==2506424==    by 0x541874: runtime.copystack (stack.go:932)
==2506424==    by 0x541DF6: runtime.newstack (stack.go:1112)
==2506424==    by 0x5554CA: runtime.morestack.abi0 (asm_amd64.s:570)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424== 
==2506424== Invalid write of size 8
==2506424==    at 0x6F2BA3: atg_rand (torch_api_generated.cpp.h:13234)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0xC000033FFF: ???
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x533537: runtime.exitsyscallfast.func1 (proc.go:3878)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /home/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424==  Address 0x1bd64750 is 0 bytes after a block of size 0 alloc'd
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0xC000033FFF: ???
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x533537: runtime.exitsyscallfast.func1 (proc.go:3878)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== Invalid read of size 8
==2506424==    at 0x637275: github.com/sugarme/gotch/ts.Rand (tensor-generated.go:36876)
==2506424==  Address 0x1bd64750 is 0 bytes after a block of size 0 alloc'd
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0xC000033FFF: ???
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x533537: runtime.exitsyscallfast.func1 (proc.go:3878)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== Invalid write of size 8
==2506424==    at 0x6F2BA3: atg_rand (torch_api_generated.cpp.h:13234)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424==  Address 0x1b08cef0 is 0 bytes after a block of size 0 alloc'd
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== Thread 4:
==2506424== Invalid write of size 8
==2506424==    at 0x6F2BA3: atg_rand (torch_api_generated.cpp.h:13234)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==  Address 0x1c623a80 is 0 bytes after a block of size 0 alloc'd
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424== 
==2506424== Conditional jump or move depends on uninitialised value(s)
==2506424==    at 0x5412B1: runtime.adjustframe (stack.go:575)
==2506424==    by 0x54B9CC: runtime.gentraceback (traceback.go:345)
==2506424==    by 0x541874: runtime.copystack (stack.go:932)
==2506424==    by 0x542745: runtime.shrinkstack (stack.go:1214)
==2506424==    by 0x511F26: runtime.scanstack (mgcmark.go:775)
==2506424==    by 0x510E44: runtime.markroot.func1 (mgcmark.go:240)
==2506424==    by 0x510AE4: runtime.markroot (mgcmark.go:213)
==2506424==    by 0x512DB1: runtime.gcDrainN (mgcmark.go:1184)
==2506424==    by 0x51186D: runtime.gcAssistAlloc1 (mgcmark.go:567)
==2506424==    by 0x511724: runtime.gcAssistAlloc.func1 (mgcmark.go:474)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424== 
==2506424== Conditional jump or move depends on uninitialised value(s)
==2506424==    at 0x5412B7: runtime.adjustframe (stack.go:575)
==2506424==    by 0x54B9CC: runtime.gentraceback (traceback.go:345)
==2506424==    by 0x541874: runtime.copystack (stack.go:932)
==2506424==    by 0x542745: runtime.shrinkstack (stack.go:1214)
==2506424==    by 0x511F26: runtime.scanstack (mgcmark.go:775)
==2506424==    by 0x510E44: runtime.markroot.func1 (mgcmark.go:240)
==2506424==    by 0x510AE4: runtime.markroot (mgcmark.go:213)
==2506424==    by 0x512DB1: runtime.gcDrainN (mgcmark.go:1184)
==2506424==    by 0x51186D: runtime.gcAssistAlloc1 (mgcmark.go:567)
==2506424==    by 0x511724: runtime.gcAssistAlloc.func1 (mgcmark.go:474)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424== 
==2506424== Invalid write of size 8
==2506424==    at 0x6F2BA3: atg_rand (torch_api_generated.cpp.h:13234)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==  Address 0x1c4e10c0 is 0 bytes after a block of size 0 alloc'd
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424== 
==2506424== 
==2506424== HEAP SUMMARY:
==2506424==     in use at exit: 44,375,527 bytes in 260,095 blocks
==2506424==   total heap usage: 1,188,890 allocs, 928,795 frees, 132,024,053 bytes allocated
==2506424== 
==2506424== Thread 1:
==2506424== 0 bytes in 1 blocks are definitely lost in loss record 3 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0xC000033FFF: ???
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x533537: runtime.exitsyscallfast.func1 (proc.go:3878)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== 0 bytes in 15 blocks are definitely lost in loss record 4 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== 0 bytes in 32 blocks are definitely lost in loss record 5 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63A9C3: _cgo_f43fd1d1fab7_Cfunc__Cmalloc (_cgo_export.c:30)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424== 
==2506424== 35 bytes in 1 blocks are definitely lost in loss record 46,245 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0x63AA73: _cgo_fbe7b6e40ad2_Cfunc__Cmalloc (_cgo_export.c:51)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x50C854: runtime.SetFinalizer.func2 (mfinal.go:450)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424==    by 0x559B84: runtime.newproc.abi0 (<autogenerated>:1)
==2506424==    by 0x94501F: ???
==2506424==    by 0x71355F: ??? (in /data00/home/niuzhenhao/git/model_test/model_test)
==2506424==    by 0x5552E4: runtime.mstart.abi0 (asm_amd64.s:390)
==2506424==    by 0x55526E: runtime.rt0_go.abi0 (asm_amd64.s:354)
==2506424== 
==2506424== 256 bytes in 1 blocks are possibly lost in loss record 150,591 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0xEE81916: mm_account_ptr_by_tid (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0xEE80E2E: mkl_serv_malloc (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0xB565E61: mkl_serv_domain_get_max_threads (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x5C41BF8: at::init_num_threads() (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x9D9FD0B: void at::native::(anonymous namespace)::batch_norm_cpu_channels_last_impl<float>(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x9DA1D44: at::native::(anonymous namespace)::batch_norm_cpu_kernel(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x61A8AF5: std::tuple<at::Tensor, at::Tensor, at::Tensor> at::native::batch_norm_cpu_transform_input_template<float, float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, at::Tensor&) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x619D7FE: at::native::batch_norm_cpu_out(at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double, at::Tensor&, at::Tensor&, at::Tensor&) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x619DF6E: at::native::batch_norm_cpu(at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x6EF1665: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__native_batch_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x69B4DB7: at::_ops::native_batch_norm::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,992 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0x5011A6: runtime.newobject (malloc.go:1202)
==2506424==    by 0x1FFF000157: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52B1E8: runtime.main.func1 (proc.go:171)
==2506424==    by 0x5553E8: runtime.systemstack.abi0 (asm_amd64.s:492)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,993 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0x105842ECD0C198DD: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52F76E: runtime.startm (proc.go:2326)
==2506424==    by 0x52FCD9: runtime.wakep (proc.go:2431)
==2506424==    by 0x533B12: runtime.newproc.func1 (proc.go:4105)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,994 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0x94501F: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52F76E: runtime.startm (proc.go:2326)
==2506424==    by 0x52FC4D: runtime.handoffp (proc.go:2361)
==2506424==    by 0x52FD54: runtime.stoplockedm (proc.go:2445)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,995 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0xFC800BFEA16DC641: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52F76E: runtime.startm (proc.go:2326)
==2506424==    by 0x52FCD9: runtime.wakep (proc.go:2431)
==2506424==    by 0x5316A4: runtime.resetspinning (proc.go:3111)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,996 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0x6BFF4AA2B6A77571: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52F76E: runtime.startm (proc.go:2326)
==2506424==    by 0x52FCD9: runtime.wakep (proc.go:2431)
==2506424==    by 0x5316A4: runtime.resetspinning (proc.go:3111)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,997 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x557263: runtime.asmcgocall.abi0 (asm_amd64.s:844)
==2506424==    by 0x7FFF: ???
==2506424==    by 0x3: ???
==2506424==    by 0xC000059167: ???
==2506424==    by 0x1FFF0001E7: ???
==2506424==    by 0x533A04: runtime.malg.func1 (proc.go:4081)
==2506424== 
==2506424== 352 bytes in 1 blocks are possibly lost in loss record 154,998 of 170,422
==2506424==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==2506424==    by 0x40116E1: allocate_dtv (dl-tls.c:286)
==2506424==    by 0x401204D: _dl_allocate_tls (dl-tls.c:532)
==2506424==    by 0x1A87CB95: allocate_stack (allocatestack.c:621)
==2506424==    by 0x1A87CB95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==2506424==    by 0x712CF0: _cgo_try_pthread_create (gcc_libinit.c:100)
==2506424==    by 0x712F1E: _cgo_sys_thread_start (gcc_linux_amd64.c:75)
==2506424==    by 0x5572A0: runtime.asmcgocall.abi0 (asm_amd64.s:874)
==2506424==    by 0x5011A6: runtime.newobject (malloc.go:1202)
==2506424==    by 0x1FFF000087: ???
==2506424==    by 0x52F1EB: runtime.newm (proc.go:2142)
==2506424==    by 0x52F76E: runtime.startm (proc.go:2326)
==2506424==    by 0x52FCD9: runtime.wakep (proc.go:2431)
==2506424== 
==2506424== 69,664 bytes in 1 blocks are possibly lost in loss record 170,408 of 170,422
==2506424==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2506424==    by 0xEE81B6C: mm_account_ptr_by_tid (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0xEE80E2E: mkl_serv_malloc (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0xB565E61: mkl_serv_domain_get_max_threads (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x5C41BF8: at::init_num_threads() (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x9D9FD0B: void at::native::(anonymous namespace)::batch_norm_cpu_channels_last_impl<float>(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x9DA1D44: at::native::(anonymous namespace)::batch_norm_cpu_kernel(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x61A8AF5: std::tuple<at::Tensor, at::Tensor, at::Tensor> at::native::batch_norm_cpu_transform_input_template<float, float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, at::Tensor&) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x619D7FE: at::native::batch_norm_cpu_out(at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double, at::Tensor&, at::Tensor&, at::Tensor&) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x619DF6E: at::native::batch_norm_cpu(at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x6EF1665: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__native_batch_norm>, std::tuple<at::Tensor, at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double> >, std::tuple<at::Tensor, at::Tensor, at::Tensor> (at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424==    by 0x69B4DB7: at::_ops::native_batch_norm::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, c10::optional<at::Tensor> const&, bool, double, double) (in /usr/local/lib/libtorch/lib/libtorch_cpu.so)
==2506424== 
==2506424== LEAK SUMMARY:
==2506424==    definitely lost: 35 bytes in 49 blocks
==2506424==    indirectly lost: 0 bytes in 0 blocks
==2506424==      possibly lost: 72,384 bytes in 9 blocks
==2506424==    still reachable: 44,303,108 bytes in 260,037 blocks
==2506424==         suppressed: 0 bytes in 0 blocks
==2506424== Reachable blocks (those to which a pointer was found) are not shown.
==2506424== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2506424== 
==2506424== For counts of detected and suppressed errors, rerun with: -v
==2506424== Use --track-origins=yes to see where uninitialised values come from
==2506424== ERROR SUMMARY: 146 errors from 23 contexts (suppressed: 0 from 0)
nullbull commented 6 months ago
[11:51:13] Top 10 stacks with outstanding allocations:
        1056 bytes in 1 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        1344 bytes in 7 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        1920 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        3520 bytes in 5 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        3520 bytes in 5 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        4560 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        5280 bytes in 55 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        5280 bytes in 5 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        19008 bytes in 99 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        23150 bytes in 125 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:19] Top 10 stacks with outstanding allocations:
        1920 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        1920 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        3840 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        7040 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        7040 bytes in 10 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        7392 bytes in 7 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        9120 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        10560 bytes in 110 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        37824 bytes in 197 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        46300 bytes in 250 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:24] Top 10 stacks with outstanding allocations:
        2880 bytes in 15 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        2880 bytes in 15 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        5760 bytes in 30 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        9504 bytes in 9 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        10560 bytes in 15 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        10560 bytes in 15 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        13680 bytes in 30 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        15840 bytes in 165 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        56832 bytes in 296 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        69450 bytes in 375 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:29] Top 10 stacks with outstanding allocations:
        3840 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        3840 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        7680 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        13728 bytes in 13 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        14080 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        14080 bytes in 20 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        18240 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        21120 bytes in 220 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        75840 bytes in 395 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        92600 bytes in 500 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:34] Top 10 stacks with outstanding allocations:
        4800 bytes in 25 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        4800 bytes in 25 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        9600 bytes in 50 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        15840 bytes in 15 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        17600 bytes in 25 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        17600 bytes in 25 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        22800 bytes in 50 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        26400 bytes in 275 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        94656 bytes in 493 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        115750 bytes in 625 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:39] Top 10 stacks with outstanding allocations:
        5760 bytes in 30 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        5776 bytes in 19 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        11520 bytes in 60 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        19008 bytes in 18 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        21120 bytes in 30 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        21120 bytes in 30 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        27360 bytes in 60 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        31680 bytes in 330 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        113664 bytes in 592 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        138900 bytes in 750 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:44] Top 10 stacks with outstanding allocations:
        6720 bytes in 35 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
        6720 bytes in 35 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        13440 bytes in 70 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        23232 bytes in 22 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        24640 bytes in 35 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        24640 bytes in 35 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        31920 bytes in 70 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        36960 bytes in 385 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        132672 bytes in 691 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        162050 bytes in 875 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:49] Top 10 stacks with outstanding allocations:
        7680 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
        7680 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        15360 bytes in 80 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        25344 bytes in 24 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        28160 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        28160 bytes in 40 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        36480 bytes in 80 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        42240 bytes in 440 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        151680 bytes in 790 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        185200 bytes in 1000 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:54] Top 10 stacks with outstanding allocations:
        8640 bytes in 45 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
        8640 bytes in 45 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
        17280 bytes in 90 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        28512 bytes in 27 allocations from stack
                c10::SmallVectorBase<unsigned int>::mallocForGrow(unsigned long, unsigned long, unsigned long&)+0x2f [libc10.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        31680 bytes in 45 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        31680 bytes in 45 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        41040 bytes in 90 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::addmm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&)+0x32 [libtorch_cpu.so]
                at::native::linear(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&)+0x45e [libtorch_cpu.so]
                [unknown]
                [unknown]
                c10::TensorImpl::~TensorImpl() [clone .localalias.356]+0x0 [libc10.so]
                [unknown]
        47520 bytes in 495 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>)+0x23 [libtorch_cpu.so]
        170688 bytes in 889 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
                [unknown]
        208350 bytes in 1125 allocations from stack
                operator new(unsigned long)+0x18 [libstdc++.so.6.0.25]
[11:51:59] Top 10 stacks with outstanding allocations:
        0 bytes in 48 allocations from stack
                [unknown]
        256 bytes in 8 allocations from stack
                [unknown]
        416 bytes in 2 allocations from stack
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
        704 bytes in 2 allocations from stack
                [unknown]
[11:52:04] Top 10 stacks with outstanding allocations:
        0 bytes in 48 allocations from stack
                [unknown]
        256 bytes in 8 allocations from stack
                [unknown]
        416 bytes in 2 allocations from stack
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
        704 bytes in 2 allocations from stack
                [unknown]
[11:52:09] Top 10 stacks with outstanding allocations:
        0 bytes in 48 allocations from stack
                [unknown]
        256 bytes in 8 allocations from stack
                [unknown]
        416 bytes in 2 allocations from stack
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
        704 bytes in 2 allocations from stack
                [unknown]
[11:52:14] Top 10 stacks with outstanding allocations:
        0 bytes in 48 allocations from stack
                [unknown]
        256 bytes in 8 allocations from stack
                [unknown]
        416 bytes in 2 allocations from stack
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
                [unknown]
        704 bytes in 2 allocations from stack
                [unknown]
sugarme commented 6 months ago

@nullbull ,

Thanks for your reporting. It could be much appreciated if you could provide the model test_full_save.pt as demonstrated in your example. Even better if you could pinpoint which function causes mem leak? Feel free to send PR if you could as I am in a very slow-response mode. Thank you!

nullbull commented 6 months ago

Can you give me an email,I send this model to you, I try to find which function but it's hard for me, i am not familiar with c++

nullbull commented 6 months ago

But I think any model may have memory leaks.

sugarme commented 6 months ago

@nullbull ,

Please do a fork and add your model file to the release or make an entry in the examples and share your fork. Other way is sharing with Google drive, dropbox or any public file sharing would be great. Thanks.

sugarme commented 6 months ago

@nullbull ,

I have quick test your example and putting forward pass inside ts.NoGrad() to complete shutdown the grad accumulation (due to ts.Randn() op by default set grad to true), I also increase size of tensor to expose any modest leak and it seems to be fine. My box memory seem to be stable for at least 1M cycles.

package main

import (
    "fmt"
    "github.com/sugarme/gotch"
    "github.com/sugarme/gotch/ts"
)

func main() {
    TestModel()
}

func TestModel() {
    N := 1_000_000_000
    m, err := ts.ModuleLoad("test_full_save.pt")
    if err != nil {
        panic(err)
    }

    m.SetEval()
    for i := 0; i < N; i++ {
        // tf := ts.MustRand([]int64{1, 7}, gotch.Float, gotch.CPU)
        tf := ts.MustRand([]int64{1024, 7}, gotch.Float, gotch.CPU)
        ts.NoGrad(func() {
            res, err := m.Forward(tf)
            if err != nil {
                panic(err)
            }
            res.MustDrop()
        })
        tf.MustDrop()

        if i%1000 == 0 {
            fmt.Printf("Done %d \n", i)
        }
    }
}

Please always handle error as well. Let's me know if that's fine in your box.

A note that when putting forward() in a for loop particularly for Go in CPU, we should see some spiky fluctuation of memory consuming.

nullbull commented 6 months ago

@sugarme I use valgrind ,it still find memory leak,I re-wrote your code and found through stress testing that the memory is still growing, but the QPS has not increased.

nullbull commented 6 months ago

@sugarme My service over 5000QPS/Per node,it's easy to reach 1M cycles, It‘s a 20C/32G node

image
sugarme commented 6 months ago

@nullbull ,

I would try the following things:

  1. Run a little longer (From the graph, your running was 10 minutes and memory increased ~ 1%). It may be plateau ?
  2. Try to move input tensor tf outside for-loop. Something like:
tf := ts.MustRand([]int64{1024, 7}, gotch.Float, gotch.CPU)
for i := 0; i < N; i++ {    
        ts.NoGrad(func() {
            res, err := m.Forward(tf)
            if err != nil {
                panic(err)
            }
            res.MustDrop()
        })
        // tf.MustDrop()

        if i%1000 == 0 {
            fmt.Printf("Done %d \n", i)
        }
    }

If no leak, then the problem is at tensor initiation ts.MustRand.

  1. For-Loop could be the problem. Depend on how you compose your server, try real use case rather than for-loop ?
nullbull commented 6 months ago

@sugarme Sorry, on the way home just now, the service using gotch has been online. Now the cluster will be restarted regularly every day to ensure that there will be no OOM. The service code is not a for loop, it is calculated once per request. I used valgrind to run 100 loops and detected a memory leak of 18B. 1 I did a stress test for 2 days last week. The service memory increased to 95% of the memory and then OOM restarted. here is my online code,requset will send a [][]float64 array and I need change it to [][]float64 tensor

        xPredict := tensors["x"].([][]float32)
    for _, v := range xPredict {
        modelInput = append(modelInput, v...)
    }
       tf := ts.MustOfSlice(modelInput).MustView([]int64{int64(len(xPredict)), int64(len(xPredict[0]))}, true)

    forward, err := model.Forward(tf)
    if err != nil {
        log.V2.Error().Str("local inf fail").With(ctx).Error(err).Emit()
    } else {
        //toString, _ := forward.ToString(10)
        log.V2.Info().With(ctx).Str("local inf success").Emit()
    }
nullbull commented 6 months ago
image

I did a stress test for an hour, and the memory went from 40.6% to 43.4%. After the request was completed, the memory dropped back to 5.6%. In theory, the memory should be 0.x% without requests, because my service is very simple and there is no local cache. Just do model inference

sugarme commented 6 months ago

@nullbull,

I suspect it's related to closed issue #102 It would be great if you could try with the suggested solution:

func Rand(...) (...) {
    var untypedPtr uintptr
    ptr := (*lib.Ctensor)(unsafe.Pointer(&untypedPtr))
    // Some C call that stores an allocated tensor at *ptr.
    retVal = &Tensor{ctensor: *ptr}
    return retVal, err
} 

Thank you. 
nullbull commented 6 months ago

@sugarme I tried, but it does not work, you can use valgrind, it still memory leak,and I did stress test,memory still going up

nullbull commented 5 months ago

@sugarme According to the above changes, there is still a memory leak. Please help me solve it.