Closed esevan closed 4 years ago
Hi Evan, might the missing kernel-id stem from it being culled? I suspect the client "has record" of the kernel, but EG has since cleaned it up.
This side affect from the previous change is a little concerning. Does the retry happen every 5-6 milliseconds if not backed off? I.e., was your EG log filled with these stack traces in mere seconds?
So if we back-off, is it only decreasing the frequency of this log dumping?
If nb2kg gets the 404, I wonder if we should use a couple failed retries to trigger cleanup of the orphaned id on the client?
@kevin-bates Hi! I suspected the culling event too, but it was not the case. After culling kernel, client accepted it and abandoned the kernel.
I think it happend in either kernel restarting logic or kernel session recovery logic because I observed EG was restarted before it happened and there were a few orphan kernels.
As for my PR, I set the back-off to decrease load to EG and also set the retry limit to cleanup after a couple of retries; As you mentioned, EG was fully filled with those stack trace in a second.
If the root causes were found, I could have posted another PR to handle 404 response or about exact exception handling. This was only option to fix quickly the problem - fault tolerance.
@esevan - thank you for the additional information. It sure would be nice to understand when this occurs. That said, I think we should probably move forward with handling the bursting and I've submitted a review for the PR.
When this happened, what kinds of process proxies were in use - k8s, yarn, distributed?
@kevin-bates Thanks, I'll dig into this problem continually. I used k8s process proxy with session persistence configuration. FWIW, I modified EG for integration with other services in our env, it was an another suspect I guess.
Thanks Evan - I'll keep on the lookout as well.
Since #42 does not limit the retry count, I found EG was OOM killed by too many retrying request to connect to unknown
kernel_id
.I couldn't find why it requested for unknown
kernel_id
, but I think this can be prevented by adding exponential backoff algorithm and limiting the retry count.I lost the error log of nb2kg, so attach the following EG log instead. As I remember, nb2kg was writing the logs like
Attempting to re-establish...
over and over.