ionelmc / pytest-benchmark

py.test fixture for benchmarking code
BSD 2-Clause "Simplified" License
1.22k stars 115 forks source link

it seems that the gpu memory could not be free between iterations or rounds in benchmark, #215

Open howin98 opened 2 years ago

howin98 commented 2 years ago

whether:

def test_alexnet_batch_size1(benchmark):
    benchmark.pedantic(run_alexnet_batch_size1, rounds=50)

or

def test_alexnet_batch_size1(benchmark):
    benchmark.pedantic(run_alexnet_batch_size1, iterations=50)

the output is: platform linux -- Python 3.7.10, pytest-7.1.0, pluggy-1.0.0 -- /opt/conda/bin/python3 benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision plugins: benchmark-3.4.1, forked-1.4.0, xdist-2.5.0 collecting ... collected 5 items

flow_vision/benchmark/test_alexnet.py::test_alexnet_batch_size16 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1 W20220329 05:37:29.127727 139 cuda_allocator.cpp:282] OOM error is detected, process will exit. And it will start to reset CUDA device for releasing device memory. F20220329 05:37:30.156129 139 cuda_allocator.cpp:285] Error! : Out of memory when allocate size : 150994944. The total_memory_bytes allocated by this CudaAllocator is : 4907335680 Check failure stack trace: @ 0x7fdf4caaa2ea (unknown) @ 0x7fdf4caaa5d2 (unknown) @ 0x7fdf4caa9e57 (unknown) @ 0x7fdf4caac9c9 (unknown) @ 0x7fdf46ed1c3a oneflow::vm::CudaAllocator::Allocate() @ 0x7fdf46edfb2d oneflow::vm::ThreadSafeAllocator::Allocate() @ 0x7fdf44875120 oneflow::vm::EagerBlobObject::TryAllocateBlobBodyMemory() @ 0x7fdf4487cd5f oneflow::vm::LocalCallOpKernelUtil::AllocateOutputBlobsMemory() @ 0x7fdf4487d6bf oneflow::vm::LocalCallOpKernelUtil::Compute() @ 0x7fdf4487c68b oneflow::vm::LocalCallOpKernelInstructionType::ComputeInFuseMode() @ 0x7fdf46ed7ce6 oneflow::vm::FuseInstructionType<>::Compute() @ 0x7fdf46ed680a oneflow::vm::CudaStreamType::Compute() @ 0x7fdf46eebd94 oneflow::vm::VirtualMachineEngine::DispatchInstruction() @ 0x7fdf46eec94f oneflow::vm::VirtualMachineEngine::DispatchAndPrescheduleInstructions() @ 0x7fdf46ef1d18 oneflow::vm::VirtualMachineEngine::Schedule() @ 0x7fdf46ee2a10 oneflow::VirtualMachine::ScheduleLoop() @ 0x7fdf4f4bd82f (unknown) @ 0x7fdf8a2bc6db start_thread @ 0x7fdf89fe561f clone Fatal Python error: Aborted

Thread 0x00007fdf8a6ed740 (most recent call first): File "/opt/conda/lib/python3.7/site-packages/oneflow/framework/tensor.py", line 985 in _numpy File "/home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision/benchmark/test_alexnet.py", line 22 in run_alexnet_batch_size16 File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 97 in runner File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 222 in _raw_pedantic File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 140 in pedantic File "/home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision/benchmark/test_alexnet.py", line 27 in test_alexnet_batch_size16 File "/opt/conda/lib/python3.7/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call File "/opt/conda/lib/python3.7/site-packages/_pytest/python.py", line 1761 in runtest File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 259 in File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 338 in from_call File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 259 in call_runtest_hook File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 219 in call_and_report File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 130 in runtestprotocol File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 347 in pytest_runtestloop File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 322 in _main File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 268 in wrap_session File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call File "/opt/conda/lib/python3.7/site-packages/_pytest/config/init.py", line 165 in main File "/opt/conda/lib/python3.7/site-packages/_pytest/config/init.py", line 187 in console_main File "/opt/conda/lib/python3.7/site-packages/pytest/main.py", line 5 in File "/opt/conda/lib/python3.7/runpy.py", line 85 in _run_code File "/opt/conda/lib/python3.7/runpy.py", line 193 in _run_module_as_main Error: Error: The process '/usr/bin/docker' failed with exit code 134

howin98 commented 2 years ago

tks a lot for any suggestions