substratusai / lingo

Lightweight ML model proxy and autoscaler for kubernetes
https://www.substratus.ai
Apache License 2.0
96 stars 6 forks source link

lingo messenger crashes causes restart of lingo #100

Open samos123 opened 2 months ago

samos123 commented 2 months ago

Error observed which triggered a restart:

2024-04-25T00:55:18Z    ERROR   setup   starting messenger  {"error": "pubsub (code=InvalidArgument): rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata = map[UTcZCGhRDk9eIz81IChFFgQIFAV8fXdTW3VdWhoHUQ0ZcnxpI2tYQQRTFAF6VVkeDGJcTkQHSaHm5PxXdabc3NvcRHFfXlsSCGpVXncBVAQadnRkcGhy-rjIwfD1jHsBNlDxo-OGZy2fhpgyZis9XxJLLD5-PTxFQV5AEkw2CURJUytDCypYEU4EISE-MD5FU0RQBhYsXUZI:PERMANENT_FAILURE_INVALID_ACK_ID]", "errorVerbose": "pubsub (code=InvalidArgument):\n    gocloud.dev/pubsub.newAckBatcher.func1\n        /go/pkg/mod/gocloud.dev@v0.37.0/pubsub/pubsub.go:794\n  - rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata = map[UTcZCGhRDk9eIz81IChFFgQIFAV8fXdTW3VdWhoHUQ0ZcnxpI2tYQQRTFAF6VVkeDGJcTkQHSaHm5PxXdabc3NvcRHFfXlsSCGpVXncBVAQadnRkcGhy-rjIwfD1jHsBNlDxo-OGZy2fhpgyZis9XxJLLD5-PTxFQV5AEkw2CURJUytDCypYEU4EISE-MD5FU0RQBhYsXUZI:PERMANENT_FAILURE_INVALID_ACK_ID]"}

I think this should be solved by restarting the messenger automatically so the autoscaler doesn't scale everything back to 0

samos123 commented 3 days ago

This is happening frequently, pretty much during every large batch:

2024-07-05T18:07:15Z    ERROR    setup    starting messenger    {"error": "pubsub (code=InvalidArgument): rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata = PERMANENT_FAILURE_INVALID_ACK_ID]", "errorVerbose": "pubsub (code=InvalidArgument):\n    gocloud.dev/pubsub.newAckBatcher.func1\n        /go/pkg/mod/gocloud.dev@v0.37.0/pubsub/pubsub.go:794\n  - rpc error: code = InvalidArgument desc = Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.\nerror details: name = ErrorInfo reason = EXACTLY_ONCE_ACKID_FAILURE domain = pubsub.googleapis.com metadata =..
samos123 commented 1 day ago

I reverted the fix because it caused hangs. The correct fix will be to recreate the subscription instead