noodlefrenzy / node-amqp10

amqp10 is a promise-based, AMQP 1.0 compliant node.js client
MIT License
134 stars 56 forks source link

Decode message failures crash node process #301

Open iamwillbar opened 7 years ago

iamwillbar commented 7 years ago

We're using node-amqp10 3.4.2 in GHCrawler and have noticed that our node process will crash if there's an issue decoding a message. The exception thrown from decodeMessage bubbles up to _processFrameEH which is an event handler and the exception isn't caught causing the process to exit.

To avoid the process exiting it would be good if this exception was handled and an 'error' event was raised from node-amqp10.

at Object.module.exports.decodeMessage (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/types/message.js:138:13)
at ReceiverLink._messageReceived (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/receiver_link.js:186:19)
at Session._processTransferFrame (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:358:45)
at Session._processFrame (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:304:63)
at Connection._processFrameEH (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:186:49)
at emitOne (events.js:101:20)
at Connection.emit (events.js:188:7)
at Connection._receiveAny (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:426:12)
at Connection._receiveData (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:357:8)
at TlsTransport.<anonymous> (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:515:38)
at emitOne (events.js:96:13)
at TlsTransport.emit (events.js:188:7)
at TLSSocket.<anonymous> (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/transport/tls_transport.js:29:49)
at emitOne (events.js:96:13)
at TLSSocket.emit (events.js:188:7)
at readableAddChunk (_stream_readable.js:176:18)
jeffmcaffer commented 7 years ago

FYI, another very similar case we just saw (throwing in the event handler) had this stack trace

AmqpMalformedPayloadError: malformed payload: Unknown code: 0x5c
    at Codec._asMostSpecific (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/codec.js:113:11)
    at Codec.decode (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/codec.js:138:22)
    at Object.module.exports.decodeMessage (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/types/message.js:136:25)
    at ReceiverLink._messageReceived (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/receiver_link.js:186:19)
    at Session._processTransferFrame (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:358:45)
    at Session._processFrame (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:304:63)
    at Connection._processFrameEH (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/session.js:186:49)
    at emitOne (events.js:101:20)
    at Connection.emit (events.js:188:7)
    at Connection._receiveAny (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:426:12)
    at Connection._receiveData (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:357:8)
    at TlsTransport.<anonymous> (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/connection.js:515:38)
    at emitOne (events.js:96:13)
    at TlsTransport.emit (events.js:188:7)
    at TLSSocket.<anonymous> (/home/ospo/prod/ospo-ghcrawler/node_modules/amqp10/lib/transport/tls_transport.js:29:49)
    at emitOne (events.js:96:13)
    at TLSSocket.emit (events.js:188:7)
mbroadst commented 7 years ago

@willbar hmm, do you have the actual hex that caused the error to occur? It's unclear what we should be doing per the spec in this case, since everything is pretty well defined. We could explicitly shut down the connection and emit an error message, however if we emitted error itself that would also cause the app to crash if you weren't listening to the event.

@jeffmcaffer that's as strange as @willbar's case... do you have any more info on the raw data that triggered this? Sounds like some corruption in the incoming stream, or equally as likely we aren't properly decoding something and the internal buffer index was corrupted.

The spec indicates that:

When a session is unable to process input, it MUST indicate this by issuing an END with an appropriate error
indicating the cause of the problem. It MUST then proceed to discard all incoming frames from the remote endpoint
until receiving the remote endpoint’s corresponding end frame

However, in this particular case it would seem like you guys probably aren't even able to process frames anymore, so it's unclear what should be done.

jeffmcaffer commented 7 years ago

Yeah, we were wondering about the best course of action as well. Shutting down the connection and restarting might be the best thing. Crashing the service is less than optimal.

As for the data, there was no data in the dump for the error I saw, just the 0x5c. For the one @willbar reported, there is (somewhere in the dumps) but we don't know what the data is so can't really send it around. I do know that the problem happened 5-6 times and in 3-4 of the cases the (large) data dump ended in the same 7d7d7d hex sequence. Point being that it may have been the same data over and over. Interestingly eventually all the queues we are working did get to 0 so not sure if that message eventually got processed or what.

iamwillbar commented 7 years ago

I was actually thinking the behavior of 'error' (crashing the app if you don't specifically handle it) would be good here, because as you mention the correct recovery is unclear and not handling it should probably crash but having it as an 'error' gives us the option of handling it and recovering (for example, by tearing down the client and reconstructing). However, if you think there's a spec-compatible way of recovering that would be even better as long as there's some way of communicating to the consumer of the library that the issue occurred.

jeffmcaffer commented 7 years ago

FWIW, I just got a mess of these. No idea what it means... The system was pushing a fair bit (~100/s) to a queue.

 received when no incoming window remaining, should shut down session but for now being tolerant.
[I] Traversed event_trigger@https://api.github.com/repos/Microsoft/pxt/events {"loopName":"3","cid":"74a","time":93}
Sat, 04 Mar 2017 07:41:56 GMT amqp10:link:sender setting credits (216,4916)
Sat, 04 Mar 2017 07:41:56 GMT amqp10:link:sender dispatching pending sends 
Transfer frame received when no incoming window remaining, should shut down session but for now being tolerant.
Transfer frame received when no incoming window remaining, should shut down session but for now being tolerant.
Transfer frame received when no incoming window remaining, should shut down session but for now being tolerant.
jeffmcaffer commented 7 years ago

I caught a hang while debug was on. See the attached log. Note that this is running the code in my fork that has an extra try/catch around the _sendMessage in send() and @willbar's changes to debug to not dump message bodies.

In this run there are 10 concurrent loops pulling messages, processing, and potentially pushing more messages. There are 4 queues (normal, immediate, soon and later) and one subscription (webhookevents/Subscriptions/crawlerprod). These are being pulled from on a weighted basis and most sends are going to the normal queue.

Not sure what to make of the credit situation. At some points the credit seems to be negative on some queues New values: link(-341), session(138) and at the end the credit on the subscription is 0.

hang.log.txt

mbroadst commented 7 years ago

@jeffmcaffer whaaaaat, can you provide a full debug (including raw data) for the cases where you are receiving negative link credit? Are you possibly using any custom credit functions in your policy?

jeffmcaffer commented 7 years ago

Here is the full log for that run. Interestingly, looking back at the full log, there is more than I first posted. Seems that 10 min after the hang, it reattached to the subscription and started getting more messages. I know I kept checking for a while but likely gave up and killed the process.

Note this run has an additional "deadletter" queue but otherwise should be as described in the previous comments. I can't provide data payloads. That's just way too big. Some payloads are 200+K (we try to avoid that but in some cases it is not feasible)

23.zip