CL_MEM_OBJECT_ALLOCATION_FAILURE after long run

lightvector / KataGo

GTP engine and self-play learning in Go

https://katagotraining.org/

Other

3.47k stars 563 forks source link

CL_MEM_OBJECT_ALLOCATION_FAILURE after long run #406

Open kcwu opened 3 years ago

kcwu commented 3 years ago

After running ./katago contribute -config contribute.cfg 6.5 days, katago crashed with

terminate called after throwing an instance of 'StringError'                                                                                              
  what():  OpenCL error at /home/kcwu/src/katago/cpp/neuralnet/openclbackend.cpp, func err, line 1188, error CL_MEM_OBJECT_ALLOCATION_FAILURE

At the crash time, I didn't use the computer and didn't run other programs. This is just one time event. Katago can continue to run by invoking the same command line.

This katago is tag v1.8.0, build with cmake . -DUSE_BACKEND=EIGEN -DUSE_TCMALLOC=1 -DCMAKE_CXX_FLAGS='-march=native' -DUSE_AVX2=1 -DBUILD_DISTRIBUTED=1

lightvector commented 3 years ago

Could you attach your log file from that run? That would help with debugging. Thanks!

kcwu commented 3 years ago

Which log file do you want?

lightvector commented 3 years ago

The log file corresponding to that run? Each time you start contribute it should make a file in the contribute katago_contribute directory, under logs/, with date based on the timestamp of the starting time of that instance.

kcwu commented 3 years ago

log20210125-203927-B29F88B22CC56B99.log attached

kcwu commented 3 years ago

I guess I know what happened.

Findings:

This GPU has only 4GB ram. The system and chrome already ate roughly 0.9GB. Only 3.1GB is available to katago.
1 model takes about 0.75GB
At the crashing time,
- kata1-b40c256-s5942610176-d1431052518 is training
- kata1-b40c256-s5942610176-d1431052518 vs kata1-b40c256-s5464776448-d1316114216 are just finished (12 minutes before crash) and not unloaded
- kata1-b40c256-s5805907200-d1398250937 vs kata1-b40c256-s5153959424-d1242653604 started and crashed
- 0.9+0.75*4=3.9 which is too close the limit.

lightvector commented 3 years ago

Ah. Okay, that sounds like that might explain it, thanks for the investigation! I was worried there was a memory leak or something in KataGo.

Any thoughts on a good option for fixing it? I guess I'll need to add some logic to instrument memory usage.

kcwu commented 3 years ago

Some ideas:

At least, we can improve the error message for CL_MEM_OBJECT_ALLOCATION_FAILURE. Tell users it might be GPU memory not enough.
Is it possible to query available GPU ram and estimate the required size for models? If yes, disable rating games temporarily if the memory is not enough to load more. If the GPU is only capable to load one model, warn users and don't get more tasks before all games are finished.

lightvector commented 3 years ago

Made the error messages slightly nicer: https://github.com/lightvector/KataGo/commit/0e75f8c82397f1b2380c4b9c581313e65fb1a9bf

Estimating required size for models and comparing it to query turns out to be tricky, so this is not going to be implemented in the near future. A workaround, for now, would be just for users on memory-limited GPUs to disable rating games if necessary (maxRatingMatches=0). This is not advertised particularly loudly right now, since it is undesirable in general for people to be using it, but in individual cases it could be a workaround.

kcwu commented 3 years ago

One idea: when CL_MEM_OBJECT_ALLOCATION_FAILURE occurs, don't interrupt the whole program. Instead, bypass the rating task, sleep a longer while, and continue to request another task from the server. Give up (terminate the program) only if no other threads are running.

lightvector commented 3 years ago

If you have an idea of how to implement that, let me know. This error is buried VERY deep, and is not even directly associated with any particular task, due to being in a different thread.