Running trainGMMOT.py the process always breaks accidently without any error infomation

shuaihuachen commented 2 years ago

你好！

我在运行trainGMMOT.py的过程中，程序总是意外终止，没有任何报错，终端只输出‘Killed’，请问这个问题可能是什么原因造成的呢?

Hi!

When I am running trainGMMOT.py, the process breaks all the time. There is no error infos but only a word ‘Killed' shown on terminal. What may cause this problem?

The log infos as follows:

MOT17-04
210
/usr/local/lib/python3.6/site-packages/torch_geometric/deprecation.py:13: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
  warnings.warn(out)
Start training...
Epoch 0/1
----------
lr = 1.00e-05
Killed

jiaweihe1996 commented 2 years ago

Maybe caused by out of memory. Please monitor the memory usage when running the code to check if this problem exists.

shuaihuachen commented 2 years ago

Only the CPU memory usage is very large. But I still don't know if this problem was caused by out of memory.

Anyway, that environment was temporary, which ran on a paid cloud server. Now I finally succesfully install all the pakages GMTracker needs on the server in my lab ( on an NVIDIA 2080ti GPU ). And I ran trainGMMOT.py, the error log infos as follows:

Traceback (most recent call last):
  File "trainGMMOT.py", line 175, in <module>
    model = train_model(
  File "trainGMMOT.py", line 92, in train_model
    s_pred = model(graph_tracks, graph_dets,iou)
  File "/root/.pyenv/versions/3.8.3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/workspace/code/GMTracker-main/GMMOT/model.py", line 155, in forward
    K1Me = a2*K_G
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 29578379040 bytes. Error code 12 (Cannot allocate memory)

This error info is obviously caused by out of memory. But why this needs such large memory? How can I fix this? I really need your help. qaq

By the way, thanks for all your reply for my posted issues. :)

shuaihuachen commented 2 years ago

It's really weird, cause Volatile GPU-Util and GPU Memory-Usage are pretty low, not large.

jiaweihe1996 commented 2 years ago

It is because the matrix dimension involved in the Kronecker product operation is very large (Eq. 8 in our paper) in crowded scenes (MOT17-04 in the training stage). Another variant using batch operation can greatly reduce memory overhead (refer to the inference code). The code will be updated soon.

OnurSelim commented 2 years ago

Hi!

I am having the same problems. Sometimes the execution is just killed by the operating system and sometimes I get the same error with @EstellalovesElk. I was wondering if you were able to find a possible solution. Thanks in advance.

Note: I have 16GB of RAM, a GTX1080 and an Intel i7-8700CPU.

Only the CPU memory usage is very large. But I still don't know if this problem was caused by out of memory.

Anyway, that environment was temporary, which ran on a paid cloud server. Now I finally succesfully install all the pakages GMTracker needs on the server in my lab ( on an NVIDIA 2080ti GPU ). And I ran trainGMMOT.py, the error log infos as follows:
Traceback (most recent call last):
  File "trainGMMOT.py", line 175, in <module>
    model = train_model(
  File "trainGMMOT.py", line 92, in train_model
    s_pred = model(graph_tracks, graph_dets,iou)
  File "/root/.pyenv/versions/3.8.3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/workspace/code/GMTracker-main/GMMOT/model.py", line 155, in forward
    K1Me = a2*K_G
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 29578379040 bytes. Error code 12 (Cannot allocate memory)
This error info is obviously caused by out of memory. But why this needs such large memory? How can I fix this? I really need your help. qaq

By the way, thanks for all your reply for my posted issues. :)

shuaihuachen commented 2 years ago

Hi! The contributor said they had a version of code that could solve this problem. But it seems like they have not released the code yet. So please wait. @OnurSelim

jiaweihe1996 commented 2 years ago

New training code is released in dev branch. Please refer to https://github.com/jiaweihe1996/GMTracker/blob/dev/GMMOT/model.py

jiaweihe1996 / GMTracker

Running trainGMMOT.py the process always breaks accidently without any error infomation #11