New error : line 5: 2686631 Killed LabGym

umyelab / LabGym

Quantify user-defined behaviors.

GNU General Public License v3.0

58 stars 5 forks source link

New error : line 5: 2686631 Killed LabGym #156

Open apenemo opened 1 month ago

apenemo commented 1 month ago

I faced this error while trainning a categorizer with augmented data :/data/labgym/bin/labgym: line 5: 2686631 Killed LabGym

I have not face it while training categorizers without augmenting data. Do you think this could be the problem or does it come from somewhere else ?

Many thanks !

yujiahu415 commented 1 month ago

Hi, This is simply because your computer system run out of memory. You can select less augmentation methods to reduce the memory consumption, if adding more memory to your system is not feasible.

vincent-legoll commented 1 month ago

@yujiahu415 are we speaking of main system RAM, or GPU/VRAM ? The first one can be increased, but not the GPU's...

The GPU card used by @apenemo is : nvidia Tesla T4 with 16GB VRAM

yujiahu415 commented 1 month ago

What I meant was RAM (CPU memory), not VRAM. If the RAM is not enough, you can simply use paging to mount virtual memory to a drive / hard disk, although this will be a little slower than just use a large RAM.

vincent-legoll commented 1 month ago

We can increase CPU RAM in the VM easily, without using swap (which would kill performance). The VM currently has 64GB RAM and I think OP's problem really was about GPU RAM, it was killed upon requesting an 16GB allocation which would likely fit in RAM. See the below warning happening just before the process being killed.

2024-05-30 14:24:57.652486: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at transpose_op.cc:184 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[240,45,45,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.

More complete error output available if you need it.

yujiahu415 commented 1 month ago

I see. Thanks for this info! Can you please provide more training details? For example, what is the duration of a behavior example? What Categorizer type did you use? What is the input size of the Categorizer? And how many training examples in total before augmentation? Thanks!