Closed JellyWX closed 3 years ago
Log from it going down a couple of minutes ago:
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::client::bridge::gateway::shard_runner] handle_voice_event; event=VoiceStateUpdate(VoiceStateUpdateEvent { guild_id: Some(GuildId(829618232096587796)), voice_state: VoiceState { channel_id: Some(ChannelId(829618232306171956)), deaf: false, guild_id: Some(GuildId(829618232096587796)), member: Some(Member { deaf: false, guild_id: GuildId(829618232096587796), joined_at: Some(2021-04-08T09:49:09.745312Z), mute: false, nick: Some("Padre Buon Signore"), roles: [RoleId(829657588404584518), RoleId(829668076835307542)], user: User { id: UserId(766392003381690369), avatar: None, bot: false, discriminator: 2821, name: "Simone", public_flags: None }, pending: false, premium_since: None }), mute: false, self_deaf: false, self_mute: true, self_stream: None, self_video: false, session_id: "8be6a7d229377b203f71b3fe09b509d8", suppress: false, user_id: UserId(766392003381690369), request_to_speak_timestamp: None } })
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::client::dispatch] handle_event; runner_tx=UnboundedSender(Some(UnboundedSenderInner { inner: UnboundedInner { state: 9223372036854775808, message_queue: Queue { head: 0x7f68f7e9e460, tail: UnsafeCell }, num_senders: 3, recv_task: AtomicWaker } })) shard_id=5
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::gateway::shard] handle_event; event=Err(Tungstenite(Protocol(ResetWithoutClosingHandshake)))
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z WARN serenity::gateway::shard] [Shard [6, 8]] Websocket error: Protocol(ResetWithoutClosingHandshake)
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::gateway::shard] [Shard [6, 8]] Will attempt to auto-reconnect
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::internal::ws_impl] create_rustls_client; url=Url { scheme: "wss", username: "", password: None, host: Some(Domain("gateway.discord.gg")), port: None, path: "/", query: Some("v=8"), fragment: None }
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::gateway::shard] handle_event; event=Ok(Dispatch(8811, MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })))
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::gateway::shard] handle_gateway_dispatch; seq=8811 event=MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })
Apr 29 12:31:09 jellywx.com soundfx-rs[3875847]: [2021-04-29T12:31:09Z INFO serenity::client::bridge::gateway::shard_runner] handle_voice_event; event=MessageCreate(MessageCreateEvent { message: Message { id: MessageId(837304999943733268), attachments: [], author: User { id: UserId(424410634919084032), avatar: Some("7c7bc69d13d92a3097a8ffee5c88d2ea"), bot: false, discriminator: 2453, name: "Adokus", public_flags: Some((empty)) }, channel_id: ChannelId(705957876232880174), content: "https://youtu.be/neqa3ih2KHY", edited_timestamp: None, embeds: [Embed { author: Some(EmbedAuthor { icon_url: None, name: "\u{64b}", proxy_icon_url: None, url: Some("https://www.youtube.com/channel/UCw7CjJ7lFfpEC5FKmIS1ySw") }), colour: Colour(16711680), description: Some("\u{200f}\u{200f}\u{200e}"), fields: [], footer: None, image: None, kind: "video", provider: Some(EmbedProvider { name: "YouTube", url: Some("https://www.youtube.com") }), thumbnail: Some(EmbedThumbnail { height: 720, proxy_url: "https://images-ext-1.discordapp.net/external/QMI7MZcMl1jzCRCl7wdB2ANZjm4cBhT_-Pf9fsuPbfo/https/i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", url: "https://i.ytimg.com/vi/neqa3ih2KHY/maxresdefault.jpg", width: 1280 }), timestamp: None, title: Some("\u{200f}\u{200f}\u{200e}"), url: Some("https://www.youtube.com/watch?v=neqa3ih2KHY"), video: Some(EmbedVideo { height: 720, url: "https://www.youtube.com/embed/neqa3ih2KHY", width: 1280 }) }], guild_id: Some(GuildId(705957876232880169)), kind: Regular, member: Some(PartialMember { deaf: false, joined_at: Some(2020-05-02T01:45:50.897Z), mute: false, nick: Some("The Sad Kid"), roles: [RoleId(707072429935296564)], pending: false, premium_since: None }), mention_everyone: false, mention_roles: [], mention_channels: [], mentions: [], nonce: String("837304994687352832"), pinned: false, reactions: [], timestamp: 2021-04-29T12:31:08.361Z, tts: false, webhook_id: None, activity: None, application: None, message_reference: None, flags: Some((empty)), stickers: [], referenced_message: None } })
perf top:
I'm still none too sure as to the cause, but I have a gut feeling it's due to ShardHandle (de)registration in songbird::shards::SerenityShardHandle. From your trace there, it looks as though Shard 6's reconnect triggers it, but the surrounding events should be hitting shard 5...
I'm wondering if it's related to the OOM issue I had the other day on a non-songbird serenity bot, and something to do with songbird is just exacerbating it (i.e that songbird has a higher CPU footprint).
Either way, the issue seems to come and go a lot. Last week it was down almost daily, but I'm yet to run into an issue this week and the issue never seems to happen on the weekends. Possibly it's something to do with when Discord restarts their gateways?
I shoved some extra logging around the suspected issue spot in both v0.2.0-beta.1
(crates.io) and next
. I haven't seen any issues on my side, I'll try this and force explicit sharding to see if I can repro.
Alright, thank you. I'll try running with that and see if I get any useful output
Thanks; did you filter these logs to only have songbird entries? Some of what is relevant might have ended up in serenity:
May 21 09:49:49.771 DEBUG run:recv_event:handle_voice_event{event=Ready(ReadyEvent { ready: Ready { application: PartialCurrentApplicationInfo { id: UserId(458276777832480769), flags: 262144 }, guilds: [Offline(GuildUnavailable { id: GuildId(192582352067624960), unavailable: true })], presences: {}, private_channels: {}, session_id: "4c3b49f3d950cc8d0e1de623779a2fba", shard: Some([0, 1]), trace: ["[\"gateway-prd-main-fw5s\",{\"micros\":37058,\"calls\":[\"discord-sessions-green-prd-2-100\",{\"micros\":33584,\"calls\":[\"start_session\",{\"micros\":29090,\"calls\":[\"api-prd-main-xngt\",{\"micros\":26241,\"calls\":[\"get_user\",{\"micros\":2332},\"add_authorized_ip\",{\"micros\":1230},\"get_guilds\",{\"micros\":4172},\"coros_wait\",{\"micros\":0}]}]},\"guilds_connect\",{\"micros\":2,\"calls\":[]},\"presence_connect\",{\"micros\":3952,\"calls\":[]}]}]}]"], user: CurrentUser { id: UserId(458276777832480769), avatar: None, bot: true, discriminator: 7265, email: None, mfa_enabled: false, name: "Test-small", verified: Some(true), public_flags: None }, version: 8 } })}: songbird::shards: Cleared queued messages.
May 21 09:49:49.772 DEBUG run:recv_event:handle_voice_event{event=Ready(ReadyEvent { ready: Ready { application: PartialCurrentApplicationInfo { id: UserId(458276777832480769), flags: 262144 }, guilds: [Offline(GuildUnavailable { id: GuildId(192582352067624960), unavailable: true })], presences: {}, private_channels: {}, session_id: "4c3b49f3d950cc8d0e1de623779a2fba", shard: Some([0, 1]), trace: ["[\"gateway-prd-main-fw5s\",{\"micros\":37058,\"calls\":[\"discord-sessions-green-prd-2-100\",{\"micros\":33584,\"calls\":[\"start_session\",{\"micros\":29090,\"calls\":[\"api-prd-main-xngt\",{\"micros\":26241,\"calls\":[\"get_user\",{\"micros\":2332},\"add_authorized_ip\",{\"micros\":1230},\"get_guilds\",{\"micros\":4172},\"coros_wait\",{\"micros\":0}]}]},\"guilds_connect\",{\"micros\":2,\"calls\":[]},\"presence_connect\",{\"micros\":3952,\"calls\":[]}]}]}]"], user: CurrentUser { id: UserId(458276777832480769), avatar: None, bot: true, discriminator: 7265, email: None, mfa_enabled: false, name: "Test-small", verified: Some(true), public_flags: None }, version: 8 } })}: songbird::manager: Registered shard handle 0.
Sorry, the serenity logging was off at the time since my journal was being annoying but I've got it running with it now so will just wait for it to die again
Apologies for that; hopefully those logs will be a bit more revealing. It looks as though there were a few channel update events near the end of the logs you've submitted which might be the root cause.
Fingers crossed- thank you for your patience with this! Really appreciated
dead.log Here's 1000 lines of dead log after it went down, with serenity on debug mode and songbird too
I can't find any mention of shard handle (de/re-)registration, which I thought were the most likely causes (i.e., in and around complete shard reconnect events). There are a few resumes, but nothing weird there (and nothing indicating a shard death due to that).
What's puzzling me is that the last events are all MessageCreate
s, which won't ever call anything to do with songbird (tracing pings them up, but they obviously fail to match Ready/VoiceStateUpdate/VoiceServerUpdate
). There are some voice events nearby, but they seem to match other users/bots, and almost all shards are fine in those last few seconds. Maybe I'll spot something tomorrow. Might be something after serenity's shard_runner.rs:119
, but before the next iteration begins.
dead.log
Another one here. Interestingly, another bot panicked at the exact same time with a BrokenPipe
error (a bot running serenity 0.10.5), and a service that interacts with discord hit a 'Connection reset by peer'.
I've also been hitting this issue and might be able to shed some light on it.
Varies a little by environment but I'll focus on my Mac since that's where I'm able to gather the most info. The memory pressure on my system will spike, eventually causing it to freeze if I don't kill the bot fast enough.
While it was running on my live server I noticed the process was switching between the D
and R
states was able to use strace
to figure out what was going on. In the 1.5 seconds I traced it for, it allocated over 600MB of memory, without exceeding the memory limit of 180 MB
. Then it finally did hit that limit, at which point the OOM killer restarted the process.
Now that I knew it was an allocation issue I used the the wonderful cargo instruments package with the MacOS instruments tool to profile the allocations while running it locally. The resulting trace from my first successful capture (after freezing my computer a few times) revealed the issue was an allocation of 384 bytes being allocated and freed super fast. If you want to see the trace, let me know, I'll have to find a place to upload it since it's over 4GB though
Compiling in debug mode, I was able to get a useful stack trace and since I knew the allocation size I was able to filter out irrelevant allocations. The problematic ones came from an __rdl_alloc
primitive in called from somewhere in tokio and I was able to figure out the only place that many bytes is allocated on a regular basis with the __rd_alloc
primitive is here.
Searching through the Songbird code, the only place where a delay_until
call is made (sleep_until
tokio 1.x) is here https://github.com/serenity-rs/songbird/blob/7e47d8bcc95d00cad4109dd48bb4e43e6a7e9477/src/driver/tasks/ws.rs#L72
Which you'll notice happens in a loop
.
My hypothesis is that somehow one of those branches is triggering the loop to run with very little to no delay and causing the memory and CPU issue.
Hopefully this is enough detail to chase down the problem on your side. If not then I think the next move would be to add a bunch of logging/tracing to figure out which path in that loop is causing the issue and wait for it to happen again. I don't understand enough of what's going on in that loop to spot a possible problematic path but it looks super state-full so logging might be needed anyway.
Reproducing seems to depend on something kinda rare happening on the discord side, which means I've had times where it failed in a few minutes and times where it ran with no issues in several days
The timings of this issue seems to strongly indicate this. The issue has never happened for me on a weekend, and very rarely in the early morning, which seems to indicate something the system admins are doing at Discord is causing this to happen
Thanks very much for looking into this; I still haven't had it happen to me yet, so this extra input is a godsend.
There are two potential delay_until
sites:
1) In the WS task as you suggest. It could be a reconnect setting the heartbeat value to something insane? This also opens up the possibility that something subtle could be going wrong in the select statement.
2) The UDP TX task as your logs show (which uses the timeout future). The only issue is that the heartbeat timing there is a sane compile-time constant.
Ah you're right I missed that there were two call sites. Looking back at my traces I see both locations so either call site is possible. Here is the stack trace for the other one. Unfortunately since my release mode traces are useless at identifying the issue it seems that I can't tell you which one is at fault (at least not yet).
Given the udp TX task is using a compile time timeout it seems likely the ws task is to blame.
Had a chance to look at this some more. I downloaded songbird and added a bunch of logging to the loop in question. Running this morning I managed to capture a 139411450
line, (14GB) log file from the run. Here's the relevant function with the extra logging. I also clipped the relevant section of the logs.
short.txt.zip
#[instrument(skip(self))]
async fn run(&mut self, interconnect: &mut Interconnect) {
let mut next_heartbeat = Instant::now() + self.heartbeat_interval;
info!("Starting main ws loop");
loop {
let mut ws_error = false;
let mut should_reconnect = false;
let hb = sleep_until(next_heartbeat);
info!("created heartbeat timeout");
select! {
_ = hb => {
info!("heartbeat timeout reached, sending heartbeat");
ws_error = match self.send_heartbeat().await {
Err(e) => {
error!("heartbeat error: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should reconnect is {}", should_reconnect);
true
},
_ => {
info!("heartbeat successful");
false
},
};
info!("ws_error is {}", ws_error);
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat is {:?}", next_heartbeat);
info!("done sending heartbeat");
}
ws_msg = self.ws_client.recv_json_no_timeout(), if !self.dont_send => {
info!("in recv_json_no_timeout, dont_send={}", self.dont_send);
ws_error = match ws_msg {
Err(WsError::Json(e)) => {
error!("Unexpected JSON {:?}.", e);
false
},
Err(e) => {
error!("Got web socket error: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should_reconnect={}", should_reconnect);
true
},
Ok(Some(msg)) => {
info!("got valid ws message, processing");
self.process_ws(interconnect, msg);
false
},
Ok(None) => {
info!("got empty ws message");
false
},
};
info!("ws_error={}", ws_error);
info!("done handling json receive no timeout");
}
inner_msg = self.rx.recv_async() => {
info!("handling inner_msg");
match inner_msg {
Ok(WsMessage::Ws(data)) => {
info!("received data");
self.ws_client = *data;
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat={:?}", next_heartbeat);
self.dont_send = false;
info!("dont_send={}", self.dont_send);
},
Ok(WsMessage::ReplaceInterconnect(i)) => {
info!("replacing interconnect");
*interconnect = i;
},
Ok(WsMessage::SetKeepalive(keepalive)) => {
info!("handling keepalive");
self.heartbeat_interval = Duration::from_secs_f64(keepalive / 1000.0);
info!("heartbeat_interval={:?}", self.heartbeat_interval);
next_heartbeat = self.next_heartbeat();
info!("next_heartbeat={:?}", next_heartbeat);
},
Ok(WsMessage::Speaking(is_speaking)) => {
info!("handling speaking");
if self.speaking.contains(SpeakingState::MICROPHONE) != is_speaking && !self.dont_send {
self.speaking.set(SpeakingState::MICROPHONE, is_speaking);
info!("Changing to {:?}", self.speaking);
let ssu_status = self.ws_client
.send_json(&GatewayEvent::from(Speaking {
delay: Some(0),
speaking: self.speaking,
ssrc: self.ssrc,
user_id: None,
}))
.await;
info!("sent speaking update");
ws_error |= match ssu_status {
Err(e) => {
error!("Error while trying to send speaking state: {:?}", e);
should_reconnect = ws_error_is_not_final(e);
info!("should_reconnect={}", should_reconnect);
true
},
_ => false,
};
info!("ws_error={}", ws_error);
}
},
Err(e) => {
error!("inner messge error: {:?}", e);
}
Ok(WsMessage::Poison) => {
error!("got poison message, breaking");
break;
},
}
info!("done handling inner_msg");
}
}
if ws_error {
self.dont_send = true;
if should_reconnect {
let _ = interconnect.core.send(CoreMessage::Reconnect);
} else {
break;
}
}
}
}
The logs change at timestamp2021-06-01T10:59:05
and based on what I can tell recv_json_no_timeout
repeatedly returns Ok(None)
with no delay. I don't know what that case signifies or what the right way of handling it would be but hopefully that helps narrow down the issue. Is there a way to trigger a reset of the voice connections/songbird state and restart like it's a fresh start or something? Maybe that would fix it?
Also from a general safety perspective it seems like this loop and maybe a few others can potentially be problematic like this. Even if they're bug free discord won't be and could still cause some undesired behavior. It seems like it would be good to define some generic safety parameters (must run no more than 100 times per second or something or if there's vector or queue check that it isn't getting really large when it shouldn't be, etc) and bail out back to the initial state and reconnect from there. Might help keep these issues from becoming such a big pain.
Thanks very much. I think the fault may lie here: https://github.com/serenity-rs/songbird/blob/current/src/ws.rs#L103-L105
self.try_next().await.ok()
is, I think, silently converting WS receive failures into None
. self.try_next().await?
is likely the correct substitution here. This is an educated guess, and seems to be the most likely explanation given all your debugging work.
Would you be able to test out a test fix? I have a branch here that applies the above modification---ideally, whatever WS issue is cropping up should actually prompt a reconnect now.
Yeah happy to try it out. I actually added the fix to my branch yesterday and left it running overnight. What I found was it does correctly return a Disconnected
error. Though due to a missed break statement when I spit this branch into two cases it ran away in an infinite loop again.
Err(e) => {
error!("inner messge error: {:?}", e);
// missing break here
}
Ok(WsMessage::Poison) => {
error!("got poison message, breaking");
break;
},
That said in your code it would have hit the break statement. Trying it again as we speak and assuming it works for a while I'll run on your branch without the logging as well. Just to be doubly sure I didn't miss something else.
I've rebuilt on that branch and pushed it now. Will see how it goes for the next week or so. Thank you!
...A full week now with no issues, looking very good. Thank you very much both @jtscuba for the far superior debugging and @FelixMcFelix for sorting out a fix for this and having patience with the little information I could gather. Much appreciated!
Thanks immensely to you both, and my apologies that work's been too hectic for me to seriously investigate. I'll merge it in shortly.
An issue I keep running into.
Information:
next
next
What happens:
After running for a while, the bot goes offline and begins to consume the maximum CPU allotted to it. Using
perf top
, this consumption can be fed back to a mutex, and a number of other calls (one called 'Sleep::new_timeout' from tokio):The memory also spikes a little bit, but there appears to be no active leaking occuring.
The bot is in around 8'000 guilds now, and runs autosharded. It usually goes down once every 12-36 hours. The server it is running on has more than enough memory/CPU available to the bot (average total system load 0.5-2.0 out of 12.0) and only runs other Discord bots and a couple of websites
My bot that is having this is at https://github.com/jellywx/soundfx-rs