Open kcwu opened 3 years ago
Could you attach your log file from that run? That would help with debugging. Thanks!
Which log file do you want?
The log file corresponding to that run? Each time you start contribute it should make a file in the contribute katago_contribute directory, under logs/, with date based on the timestamp of the starting time of that instance.
I guess I know what happened.
Findings:
Ah. Okay, that sounds like that might explain it, thanks for the investigation! I was worried there was a memory leak or something in KataGo.
Any thoughts on a good option for fixing it? I guess I'll need to add some logic to instrument memory usage.
Some ideas:
CL_MEM_OBJECT_ALLOCATION_FAILURE
. Tell users it might be GPU memory not enough.Made the error messages slightly nicer: https://github.com/lightvector/KataGo/commit/0e75f8c82397f1b2380c4b9c581313e65fb1a9bf
Estimating required size for models and comparing it to query turns out to be tricky, so this is not going to be implemented in the near future. A workaround, for now, would be just for users on memory-limited GPUs to disable rating games if necessary (maxRatingMatches=0
). This is not advertised particularly loudly right now, since it is undesirable in general for people to be using it, but in individual cases it could be a workaround.
One idea: when CL_MEM_OBJECT_ALLOCATION_FAILURE
occurs, don't interrupt the whole program. Instead, bypass the rating task, sleep a longer while, and continue to request another task from the server. Give up (terminate the program) only if no other threads are running.
If you have an idea of how to implement that, let me know. This error is buried VERY deep, and is not even directly associated with any particular task, due to being in a different thread.
After running
./katago contribute -config contribute.cfg
6.5 days, katago crashed withAt the crash time, I didn't use the computer and didn't run other programs. This is just one time event. Katago can continue to run by invoking the same command line.
This katago is tag v1.8.0, build with
cmake . -DUSE_BACKEND=EIGEN -DUSE_TCMALLOC=1 -DCMAKE_CXX_FLAGS='-march=native' -DUSE_AVX2=1 -DBUILD_DISTRIBUTED=1