una-dinosauria / Rayuela.jl

Code for my PhD thesis. Library of quantization-based methods for fast similarity search in high dimensions. Presented at ECCV 18.
MIT License
63 stars 4 forks source link

LSQ training got stuck #38

Closed dryman closed 5 years ago

dryman commented 5 years ago

Hi, we're trying to reproduce ECCV'18 paper.

The trainer got stuck in this stage:

Running CUDA LSQ training... 
**********************************************************************************************
Training LSQ GPU with 7 codebooks, 4 perturbations, 4 icm iterations and random order = true
**********************************************************************************************
Doing fast bin codebook update... done in 0.129 seconds.
 -2 1.913506e+04 
Creating 100000 random states... done in 0.15 seconds
^^^ stuck on this stage for 3 hours ^^^^^^

We checked the GPU utilization and found it was zero. Is this expected?

una-dinosauria commented 5 years ago

Nope, that is pretty weird. Could you please post the command that you ran? Also, if you terminate ctrl+c when julia is stuck there, what does the stack trace say?

dryman commented 5 years ago

My command was include("demos_train_query_base.jl") Terminating julia via ctrl+c didn't work. Killing it with kill I got this:

in expression starting at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:170
clock_gettime at linux-vdso.so.1 (unknown line)
__clock_gettime at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fd1d47110dd)
unknown function (ip: 0x7fd1d47c6996)
unknown function (ip: 0x7fd1d46f979b)
unknown function (ip: 0x7fd1d46e9947)
unknown function (ip: 0x7fd1d46ea73c)
unknown function (ip: 0x7fd1d45f78dd)
unknown function (ip: 0x7fd1d45f9297)
cuMemcpyHtoD_v2 at /usr/lib/x86_64-linux-gnu/libcuda.so (unknown line)
macro expansion at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/base.jl:146 [inlined]
#upload!#10 at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:230
upload! at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:229 [inlined]
upload! at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:229 [inlined]
unsafe_copyto! at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:76 [inlined]
copyto! at /usr/local/google/home/fchern/.julia/packages/GPUArrays/AkOwl/src/abstractarray.jl:116
convert at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:99 [inlined]
convert at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:105 [inlined]
Type at /usr/local/google/home/fchern/.julia/packages/GPUArrays/AkOwl/src/construction.jl:36 [inlined]
encode_icm_cuda_single at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:72
encode_icm_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:231
macro expansion at ./printf.jl:159 [inlined]
train_lsq_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:297
experiment_lsq_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:345
unknown function (ip: 0x7fd1dc83b100)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
run_demos at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:72
top-level scope at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:171 [inlined]
top-level scope at ./none:0
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:841
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:865
include at ./boot.jl:317 [inlined]
include_relative at ./loading.jl:1038
include at ./sysimg.jl:29
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
include at ./client.jl:398
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:324
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:428
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:363 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:686
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:799
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7fd18828828f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:808
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:831
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:633
eval at ./boot.jl:319
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:85
run_backend at /usr/local/google/home/fchern/.julia/packages/Revise/EuQoV/src/Revise.jl:771
#58 at ./task.jl:262
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1538 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)
unknown function (ip: 0xffffffffffffffff)
Allocations: 1415364916 (Pool: 1414649159; Big: 715757); GC: 20296
InterruptException
atexit hook threw an error: OutOfMemoryError()
signal (11): Segmentation fault
in expression starting at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:170
throw_internal at /buildworker/worker/package_linux64/build/src/task.c:563
jl_rethrow at /buildworker/worker/package_linux64/build/src/task.c:584
unknown function (ip: 0xffffffffffffffff)
una-dinosauria commented 5 years ago

Closing since this seems to have been caused by OOM in the GPU and the host machine not killing Julia -- xref https://github.com/una-dinosauria/Rayuela.jl/issues/40#issuecomment-532334347