Closed johnseekins-pathccm closed 9 months ago
Nothing in the code immediately stands out to me as the culprit. I have a few probing questions:
Nothing in the code immediately stands out to me as the culprit. I have a few probing questions:
* how is Hubot hosted? i.e. in kubernetes, on an EC2 instance, ????
Docker container in ECS
- how many instances of Hubot are running? 1
- Does a single instance of Hubot have access to Prod, Dev and Stage? Yes. Read-only access. And importantly, it's to the ECS Clusters. Not separate accounts/environments/etc.
- What version of Hubot is running? 11.1
Does it only respond 4 times when the value is "Production"?
Or when I leave it to "default". So when cluster
=== Production
.
what chat adapter are you using? Does it respond 4 times with the same answer?
what chat adapter are you using? Does it respond 4 times with the same answer?
https://github.com/hubot-friends/hubot-slack Yep. Exact same response, 4 times. Also takes about 20 minutes to get all four replies.
(Updated all that in the initial question, too)
Ok. I've seen this behavior before during development. The issue was that the code failed to acknowledge the message. In that situation. the Slack system will "retry sending the message". Here's where the code is supposed to acknowledge the message.
I also see an issue in the Slack Adapter. It's not awaiting
robot.receive
. I'm unsure what that will cause, but I'll have to push a fix for that.
Can you start Hubot with HUBOT_LOG_LEVEL=debug to see what line of code the execution is getting to?
I've also added the await
call in the Slack Adapter.
Seems to just...receive the message multiple times? To be clear...I definitely only typed it once, but this pattern (and I'm hesitant to give you full log messages...) looks like it's just...getting the message again.
Updated to the new adapter and I still get the duplicate messages. :(
Another thought is to await
res.send
because it's async
.
await res.send()
also doesn't help.
Is it odd that the envelope_id
is different for each of those messages?
Can you run a Hubot instance locally on your machine and replace the behavior?
It sounds like you might have a plausible cause, so add several grains of salt to anything in this comment :)
When I've observed Hubot get into a repeats-replies state, I had a suspicion it related to functionality such as remind-her or polling plugins (eg watch statuspage, report when status changes). It seemed like the use of setTimeout()
or setInterval()
could create concurrent threads. (The fact that you see it reply four times specifically suggests to me this doesn't quite fit ... but maybe there's a magic number in that system I don't know about.)
If the current best theory doesn't pan out, maybe consider which plugins could be disabled to isolate the behaviour?
There is a timeout in the slack response! Because this query to AWS is relatively slow, that doesn't entirely surprise me:
{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Text = @hubot ecs list stale tasks"}
{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Event subtype = undefined"}
{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Received generic message: message"}
{"level":20,"time":1709307623090,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Message '@hubot ecs list stale tasks' matched regex //^\\s*[@]?Hubot[:,]?\\s*(?:ecs list stale tasks( in )?([A-Za-z0-9-]+)?)/i/; listener.options = { id: null }"}
{"level":20,"time":1709307626395,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"eventHandler {
\"envelope_id\": \"bd22596e-ee19-4201-8250-792f91fc96d7\",
\"body\": {
\"token\": \"<>\",
\"team_id\": \"<>\",
\"context_team_id\": \"<>\",
\"context_enterprise_id\": null,
\"api_app_id\": \"<>\",
\"event\": {
\"client_msg_id\": \"<>\",
\"type\": \"message\",
\"text\": \"<@hubot> ecs list stale tasks\",
\"user\": \"<>\",
\"ts\": \"1709307622.850469\",
\"blocks\": [
{
\"type\": \"rich_text\",
\"block_id\": \"5X8EE\",
\"elements\": [
{
\"type\": \"rich_text_section\",
\"elements\": [
{
\"type\": \"user\",
\"user_id\": \"<>\"
},
{
\"type\": \"text\",
\"text\": \" ecs list stale tasks\"
}
]
}
]
}
],
\"team\": \"<>\",
\"channel\": \"<>\",
\"event_ts\": \"1709307622.850469\",
\"channel_type\": \"channel\"
},
\"type\": \"event_callback\",
\"event_id\": \"<>\",
\"event_time\": 1709307622,
\"authorizations\": [
{
\"enterprise_id\": null,
\"team_id\": \"<>\",
\"user_id\": \"<>\",
\"is_bot\": true,
\"is_enterprise_install\": false
}
],
\"is_ext_shared_channel\": false,
\"event_context\": \"<>\"
}
\"event\": {
\"client_msg_id\": \"<>\",
\"type\": \"message\",
\"text\": \"<@hubot> ecs list stale tasks\",
\"user\": \"<>\",
\"ts\": \"1709307622.850469\",
\"blocks\": [
{
\"type\": \"rich_text\",
\"block_id\": \"5X8EE\",
\"elements\": [
{
\"type\": \"rich_text_section\",
\"elements\": [
{
\"type\": \"user\",
\"user_id\": \"<>\"
},
{
\"type\": \"text\",
\"text\": \" ecs list stale tasks\"
}
]
}
]
}
],
\"team\": \"<>\",
\"channel\": \"<>\",
\"event_ts\": \"1709307622.850469\",
\"channel_type\": \"channel\"
},
\"retry_num\": 1,
\"retry_reason\": \"timeout\",
\"accepts_response_payload\": false
}"
}
{"level":20,"time":1709307626395,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"event {
\"envelope_id\": \"<>\",
\"body\": {
\"token\": \"<>\",
\"team_id\": \"<>\",
\"context_team_id\": \"<>\",
\"context_enterprise_id\": null,
\"api_app_id\": \"<>\",
\"event\": {
\"client_msg_id\": \"<>\",
\"type\": \"message\",
\"text\": \"<@hubot> ecs list stale asks\",
\"user\": \"<>\",
\"ts\": \"1709307622.850469\",
\"blocks\": [
{
\"type\": \"rich_text\",
\"block_id\": \"5X8EE\",
\"elements\": [
{
\"type\": \"rich_text_section\",
\"elements\": [
{
\"type\": \"user\",
\"user_id\": \"<>\"
},
{
\"type\": \"text\",
\"text\": \" ecs list stale tasks\"
}
]
}
]
}
],
\"team\": \"<>",
\"channel\": \"<>\",
\"event_ts\": \"1709307622.850469\",
\"channel_type\": \"channel\"
},
\"type\": \"event_callback\",
\"event_id\": \"<>\",
\"event_time\": 1709307622,
\"authorizations\": [
{
\"enterprise_id\": null,
\"team_id\": \"<>\",
\"user_id\": \"<>\",
\"is_bot\": true,
\"is_enterprise_install\": false
}
],
\"is_ext_shared_channel\": false,
\"event_context\": \"<>\"
},
\"event\": {
\"client_msg_id\": \"<>\",
\"type\": \"message\",
\"text\": \"<@hubot> ecs list stale tasks\",
\"user\": \"<>",
\"ts\": \"1709307622.850469\",
\"blocks\": [
{\n \"type\": \"rich_text\",
\"block_id\": \"5X8EE\",
\"elements\": [
{
\"type\": \"rich_text_section\",
\"elements\": [
{
\"type\": \"user\",
\"user_id\": \"<>\"
},
{
\"type\": \"text\",
\"text\": \" ecs list stale tasks\"
}
]
}
]
}
],
\"team\": \"<>\",
\"channel\": \"<>\",
\"event_ts\": \"1709307622.850469\",
\"channel_type\": \"channel\"
},
\"retry_num\": 1,
\"retry_reason\": \"timeout\",
\"accepts_response_payload\": false}
user = <>"
}
It's definitely me racing a timeout! I changed the code to batch AWS requests more efficiently and I'm no longer getting duplicate messages!
Relevant code:
/*
* Stale Deploys
*/
robot.respond(/ecs list stale tasks( in )?([A-Za-z0-9-]+)?/i, async res => {
const cluster = res.match[2] || defaultCluster
const services = await paginateServices(ecsClient, cluster)
// no need to sort these results
const serviceNames = services.map((x) => x.split('/')[x.split('/').length - 1])
const staleDateShort = new Date(Date.now() - shortExpireSecs)
const staleDateLong = new Date(Date.now() - longExpireSecs)
const expiredDate = new Date(Date.now() - expiredSecs)
let ignored = []
let shortExp = []
let longExp = []
let exp = []
/*
* Collect service data
*/
const chunkSize = 10
for (let i = 0; i < serviceNames.length; i += chunkSize) {
let chunk = serviceNames.slice(i, i + chunkSize)
const ignoredFromChunk = chunk.filter((service) => ignoredServices.includes(service))
ignored.push.apply(ignored, ignoredFromChunk)
chunk = chunk.filter((service) => !ignoredServices.includes(service))
if (chunk.length < 1) {
continue
}
let input = {
cluster,
services: chunk,
include: []
}
let command = new DescribeServicesCommand(input)
let serviceData
try {
serviceData = await ecsClient.send(command)
serviceData = serviceData.services
} catch (err) {
robot.logger.error(`Request to AWS failed: ${err}`)
}
for (let idx = 0; idx < serviceData.length; idx++) {
const deployDate = new Date(serviceData[idx].deployments[0].createdAt)
// skip any service newer than our longest expiration window
if (deployDate > staleDateLong) {
continue
}
const servString = `\`${serviceData[idx].serviceName}\` (deployed ${deployDate.toISOString()})`
if (deployDate < expiredDate) {
exp.push(servString)
} else if (deployDate < staleDateShort) {
shortExp.push(servString)
} else {
longExp.push(servString)
}
}
}
Well done tracking down this bug.
I don't see the code that "batches the AWS requests". Would you mind pointing it out for me? I'd love to see how you solved it.
I'm also curious if there's a move I can make to the Slack Adapter to either not let this situation happen or make it very visible that it's happening.
Sure. The batch happens here:
const chunkSize = 10
for (let i = 0; i < serviceNames.length; i += chunkSize) {
let chunk = serviceNames.slice(i, i + chunkSize)
const ignoredFromChunk = chunk.filter((service) => ignoredServices.includes(service))
ignored.push.apply(ignored, ignoredFromChunk)
chunk = chunk.filter((service) => !ignoredServices.includes(service))
if (chunk.length < 1) {
continue
}
let input = {
cluster,
services: chunk,
include: []
}
let command = new DescribeServicesCommand(input)
let serviceData
try {
serviceData = await ecsClient.send(command)
serviceData = serviceData.services
} catch (err) {
robot.logger.error(`Request to AWS failed: ${err}`)
}
Let's expand that a bit:
Instead of doing
for (let i = 0; i < serviceNames.length; i++) {
const service = serviceNames[i]
if (ignoredServices.includes(service)) {
ignored.push(service)
continue
}
let input = {
cluster,
services: [service],
include: []
}
let command = new DescribeServicesCommand(input)
let serviceData
try {
serviceData = await ecsClient.send(command)
serviceData = serviceData.services[0]
} catch (err) {
robot.logger.error(`Request to AWS failed: ${err}`)
}
I now loop through the list of serviceArns
in groups of 10 (and do some filtering). This means that I would send a request like ['service1', 'service2', ..., 'service10']
instead of [service1]
, [service2]
, etc. Reducing the time taken collecting data from AWS by a factor of 10.
I think perhaps surfacing the request timeout (somehow) would be amazing. Just so we know it's there.
I see. The new code changes
let input = {
cluster,
services: [service],
include: []
}
to
let input = {
cluster,
services: services,
include: []
}
where services
is an array of service names without the ignored ones.
Closing this as it seems to be more an issue with timeouts within adapters. Thanks for the help!
We've written a custom hubot script to interact with AWS and report back to users. One of our responses will respond 4 times (with the exact same data, over ~20 minutes) in certain circumstances, and I can't understand what's happening.
So it starts normal enough:
So then I can say something like
@hubot ecs list stale tasks in Production
and Hubot will come back with data about what we consider stale tasks.What's interesting is that depending on the cluster I select, Hubot will either reply once (expected) or 4 times (far less expected):
@hubot ecs list stale tasks in Production
responds 4 times with the exact same data, over about 20 minutes.@hubot ecs list stale tasks in Development
responds once.@hubot ecs list stale tasks in Staging
responds once.We're using https://github.com/hubot-friends/hubot-slack as our adapter.
We only run a single instance of Hubot (because otherwise it can go split-brain) in a docker container in ECS. We are on the most recent release of Hubot (11.1.1).
My instinct is that it's something in the data, or how I'm handling pagination with AWS requests. But I'm honestly not sure. So any hints that y'all could provide would be amazing.
I'll provide the entire script below, just please don't judge my JS too harshly, I've never been good at it.
Actual script
``` use strict"; // Description: // Tool to introspect into ECS // Upstream docs: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/ecs/ // // Dependencies: // "@aws-sdk/client-ecs": "^3" // // Configuration: // HUBOT_AWS_REGION // HUBOT_AWS_ACCESS_KEY_ID // HUBOT_AWS_SECRET_ACCESS_KEY // // Commands: // hubot ecs list clusters - returns all ECS clusters in the defined region // hubot ecs describe cluster