Parsing of message body of messages in the ML Switch, JWS implications

Summary: The body parsing in ml-api-adapter is breaking JWS! It is not respecting the body bytewise. A string which contains an escaped unicode character is being unescaped. This breaks the JWS spec.

For example Payer sends this message: {"errorInformation":{"errorCode":"5200","errorDescription":"Generic limit error, amount \u0026 payments threshold."}}

Payee receives this message: {"errorInformation":{"errorCode":"5200","errorDescription":"Generic limit error, amount & payments threshold."}} See the escaped unicode sequence in the incoming message, this is allowed as per the JSON RFC.

The issue is that the JSON RFC allows escaped unicode characters, these get unescaped by the default body parsing mechanism in ml-api-adapter (whatever body parser HAPI uses). The unescaping breaks JWS as forwarded message bodies must be byte-for-byte identical to the incoming bodies. The only way mojaloop can be sure to not break JWS is to retain the incoming byte stream for forwarding at a later point. This is a mojaloop wide issue, across the entire surface, ml-api-adapter, all the kafka handlers etc... The switch is not allowed to alter the bodies of any message it forwards on, but it does under certain circumstances, so will require a fix.

Severity: High

Priority: Critical

Expected Behavior

Steps to Reproduce Incoming message : {"errorInformation":{"errorCode":"5200","errorDescription":"Generic limit error, amount \u0026 payments threshold."}}

Outgoing message: {"errorInformation":{"errorCode":"5200","errorDescription":"Generic limit error, amount & payments threshold."}}

See the escaped unicode sequence in the incoming message, this is allowed as per the JSON RFC.

Specifications

Component (if known): central services (mostly all)
Version: all versions as of 3rd May 2019
Platform: AWS (anything for that matter)
Subsystem: central services, ml-api-adapter
Type of testing: Verification
Bug found/raised by: @bushjames @ksatya77

Notes:

Severity when opened: High
Priority when opened: Critical
The description section is paraphrased from the discussion regarding this on a slack channel brought up by James
There were four proposals made during that discussion, some of which will be discussed as part of this issue in the comments.
Additional notes:
- The switch should pass it (incoming payload/message body) on without unescaping it. But, I suspect, the default behavior of node.js JSON parsing is to unescape these sequences. Even if there were an option on the decoder to not unescape, the processing path for messages through all the handlers etc... means the bodies are encoded/decoded several times. That would be a near impossible task to fix.
- My suggestion is to preserve the incoming byte stream (prob as base64 encoded string) along with the payload in all the kafka messages and use that to reconstruct the body when making the callback, rather than the "payload" kafka message property that is currently used. that would require a coordinated change to the kafka message "protocol"/structure across all the handlers and ml-api-adapter.

Tasks:

[x] explore the hapi way to preserve both binary and parsed payload to be able run validations [@vgenev]
[x] Prepare a mojaloop streaming library component to encode and decode the raw data [@vgenev]
[x] Prepare a hapi plugin to get the raw data and encode it as base64 data URI string, ready to be passed to the next components [@vgenev]
[x] add jsdocs for the streaming library [@vgenev]
[x] update test and coverage [@vgenev]
[x] merge streaming library changes and make a release [@vgenev]
[x] add the plugin to ml-api-adapter [@vgenev]
[x] update ml-api-adapter coverage and tests [@vgenev]
[x] change the central-ledger handlers to use encoded payload [@vgenev]
[x] Releases for ml-api-adapter, central-services-stream and central-ledger
[x] Helm changes

Acceptance criteria:

[x] The body is preserved as it is bytewise
[x] The body is validated either with hapi-openapi or standard Joi validation approach
[x] Mojaloop standard component for encoding and decoding is available
[x] Hapi plugin that utilizes the standard component and can be added to the server without changing the handlers logic is available
[x] ml-api-adapter works and all tests are passing
[x] central-ledger works and all tests are passing

Pull Requests:

Canonical encoding of JSON is a known issue. This is why all JOSE specs require payloads to be base64url encoded and the signatures computed against the raw bytes of that encoding.

My suggestion is to preserve the incoming byte stream (prob as base64 encoded string) along with the payload in all the kafka messages and use that to reconstruct the body when making the callback, rather than the "payload" kafka message property that is currently used. That would require a coordinated change to the kafka message "protocol"/structure across all the handlers and ml-api-adapter.

This is the approach taken by JOSE but it also introduces a potential security foot-gun. Implementors need to ensure that they validate the signature against the binary version of the data but ALSO ensure that the binary data matches the decoded data used for processing. Just something to be aware of.

The current Signtaure spec does 90% of JWT but then throws away the base64url encoded payload.

It puts the following header into the request:

{
  "signature":
  "dz2ntyS0_rDyA0pLeWluG--tBcYYrlvG99ffkXcEBuve5Qzvzyn0ZUi82J7h17RsdfHPuTnbEGvCeU9Y4Bg0nIZHGL4icswaaO09T5hPPYKBTzVQeHkokLmL4dXpHdr1ggSEpu3WEU3nfgOFGGAdOq355i1iGuDbhqm_lSfVHaqdVCEh
kJ2Y_r2glO2QpdZrcbvsBV39derj_PlfISBBGjdh0dIPxnFIVcZuPHiq9Ha2MslrBHfqwF
fNeU_xhErBd2PywkDQJbKOlfqdkmFC9bS8Ofx0O6Mg7qdFGwQkseJTfp0HMbH1d9e6H0cocY8xfuDNGaZpOJhxiYtiPLg", 
  "protectedHeader":
 "eyJhbGciOiJSUzI1NiIsIkZTUElPUC1EZXN0aW5hdGlvbiI6IjU2NzgiLCJGU1BJT1AtVVJJIjoiL3F1b3RlcyIsIkZTUElPUC1IVFRQLU1ldGhvZCI6IlBPU1QiLCJEYXRlIjoiVHVlLCAyMyBNYXkgMjAxNyAyMToxMjozMSBHTVQiLCJGU1BJT1AtU291cmNlIjoiMTIzNCJ9"
}

If it followed the JWT spec fully this header would simply be a JWT (i.e. the protected header, payload and signature all encoded and separated by a ..

Something like:

eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsIkZTUElPUC1EZXN0aW5hdGlvbiI6IjU2NzgiLCJGU1BJT1AtVVJJIjoiL3F1b3RlcyIsIkZTUElPUC1IVFRQLU1ldGhvZCI6IlBPU1QiLCJEYXRlIjoiVHVlLCAyMyBNYXkgMjAxNyAyMToxMjozMSBHTVQiLCJGU1BJT1AtU291cmNlIjoiMTIzNCJ9.eyJhbW91bnQiOnsiYW1vdW50IjoiMTUwIiwiY3VycmVuY3kiOiJVU0QifSwidHJhbnNhY3Rpb25UeXBlIjp7InNjZW5hcmlvIjoiVFJBTlNGRVIiLCJpbml0aWF0b3IiOiJQQVlFUiIsInN1YlNjZW5hcmlvIjoiUDJQIFRyYW5zZmVyIGFjcm9zcyBNTSBzeXN0ZW1zIiwiaW5pdGlhdG9yVHlwZSI6IkNPTlNVTUVSIn0sInRyYW5zYWN0aW9uSWQiOiIzNjYyOWE1MS0zOTNhLTRlM2MtYjM0Ny1jMmNiNTdlMWUxZmMiLCJxdW90ZUlkIjoiNTllMzMxZmEtMzQ1Zi00NTU0LWFhYzgtZmNkODgzM2Y3ZDUwIiwiZXhwaXJhdGlvbiI6IjIwMTctMDUtMjRUMDg6NDA6MDAuMDAwLTA0OjAwIiwicGF5ZXIiOnsicGVyc29uYWxJbmZvIjp7ImRhdGVPZkJpcnRoIjoiMTk4Ni0wMi0xNCIsImNvbXBsZXhOYW1lIjp7Im1pZGRsZU5hbWUiOiJCZW4iLCJMYXN0TmFtZSI6IkxlZSIsImZpcnN0TmFtZSI6IkJpbGwifX0sIm5hbWUiOiJCaWxsIExlZSIsInBhcnR5SWRJbmZvIjp7ImZzcElkIjoiMTIzNCIsInBhcnR5SWRUeXBlIjoiTVNJU0ROIiwicGFydHlTdWJJZE9yVHlwZSI6IlJlZ2lzdGVyZWRDdXN0b21lciIsInBhcnR5SWRlbnRpZmllciI6IjE2MTM1NTUxMjEyIn19LCJwYXllZSI6eyJwYXJ0eUlkSW5mbyI6eyJmc3BJZCI6IjU2NzgiLCJwYXJ0eUlkVHlwZSI6Ik1TSVNETiIsInBhcnR5SWRlbnRpZmllciI6IjE1Mjk1NTU4ODg4In19LCJmZWVzIjp7ImFtb3VudCI6IjEuNSIsImN1cnJlbmN5IjoiVVNEIn0sImV4dGVuc2lvbkxpc3QiOlt7InZhbHVlIjoidmFsdWUxIiwia2V5Ijoia2V5MSJ9LHsidmFsdWUiOiJ2YWx1ZTIiLCJrZXkiOiJrZXkyIn0seyJ2YWx1ZSI6InZhbHVlMyIsImtleSI6ImtleTMifV0sIm5vdGUiOiJ0aGlzIGlzIGEgc2FtcGxlIGZvciBQT1NUIC9xdW90ZXMiLCJnZW9Db2RlIjp7ImxvbmdpdHVkZSI6IjEyNS41MjAwMDEiLCJsYXRpdHVkZSI6IjU3LjMyMzg4OSJ9LCJhbW91bnRUeXBlIjoiUkVDRUlWRSJ9.Xfs9A2tB2XDIUqBfewt8L19y4rD3NyLFJi-y2tnOUcy9vpz8WSXW-3yQU7RiOdjmSYhcpaM6wSVssaimgCSa-ufuFc1eUQEUdB-kiEkh9-_qZHJXOKQRDRV8cz5QoIMsFnGv8sddE2jwE79qOyOblkbOsoaC-yeWb7OQV2i2AaagcpINbInuo9sUVIXzNLso-WqGBHLyZzjw9U0X57TboR076Mjj2W0k8Xk5uz7z2aLBh_IN5yEYJaEXVw0XgekTHpMmZE3R9Ji1hbwf3k63vtaYYGdXClxHTF8Jkbzkm-ESBt61AlZH2IbzuzBEGlV-VAdXFwR-V5SfOOHWVl72PA

See on JWT.io

One thing to note is that we MAY hit limits on header sizes. The HTTP spec doesn't specify limits but some servers do implement them (lowest I am aware of is 8Kb) so we're still well within that limit.

Currently the ML-API-Adapter parses the incoming message into a JSON Object via the Hapijs library, which is then sent to the node-rdkafka streaming library, and subsequently, each handler either decodes or re-encodes the message from kafka using the same node-rdkafka library.

Messages are being stored on Kafka based on the Lime protocol (https://github.com/mojaloop/central-services-stream/blob/master/src/kafka/protocol/readme.md).

The header & payload is stored in the content field as per the following example:

{
    "id": "f2f038cc-b749-464d-a364-c24acad58ef0",
    "to": "mockfsp02",
    "from": "mockfsp01",
    "type": "application/json",
    "content": {
        "headers": {
            "accept": "application/vnd.interoperability.transfers+json;version=1",
            "content-type": "application/vnd.interoperability.transfers+json;version=1",
            "date": "2019-01-22T21:27:55.000Z",
            "fspiop-source": "mockfsp01",
            "fspiop-destination": "mockfsp02",
            "content-length": 437
        },
        "payload": {
            "transferId": "f2f038cc-b749-464d-a364-c24acad58ef0",
            "payerFsp": "mockfsp01",
            "payeeFsp": "mockfsp02",
            "amount": {
                "amount": "1",
                "currency": "USD"
            },
            "ilpPacket": "eyJxdW90ZUlkIjoiZThmYzFlYzEtZTM5Yy00NDBjLWJjNmQtZjg1NjM2MzYwMTBiIiwidHJhbnNhY3Rpb25JZCI6ImU4ZmMxZWMxLWUzOWMtNDQwYy1iYzZkLWY4NTYzNjM2MDEwYiJ9",
            "condition": "ta5inyYarm7thvY3DPHZgSDK0ml_kuGOPi-qhiF1SCs",
            "expiration": "2019-12-10T12:07:47.349Z"
        }
    },
    "metadata": {
        "event": {
            "id": "25240fa4-da6a-4f18-8b42-e391fde70817",
            "type": "prepare",
            "action": "prepare",
            "createdAt": "2019-05-06T08:53:16.996Z",
            "state": {
                "status": "success",
                "code": 0
            }
        }
    }
}

An additional field payloadType could be added under the content field to provide the following information:

content type <-- the original representation of the data stored based on Mime type
transfer encoding <-- the encoding used for the message transport which is an additional parameter

payloadType = '<Content-Type>;encode=<Transfer-Encoding>'

Here is an example where the payload contains the above JSON which has been encoded as base64:

{
    "id": "f2f038cc-b749-464d-a364-c24acad58ef0",
    "to": "mockfsp02",
    "from": "mockfsp01",
    "type": "application/json",
    "content": {
        "headers": {
            "accept": "application/vnd.interoperability.transfers+json;version=1",
            "content-type": "application/vnd.interoperability.transfers+json;version=1",
            "date": "2019-01-22T21:27:55.000Z",
            "fspiop-source": "mockfsp01",
            "fspiop-destination": "mockfsp02",
            "content-length": 437
        },
        "payload": "ewogICAgICAgICAgICAidHJhbnNmZXJJZCI6ICJmMmYwMzhjYy1iNzQ5LTQ2NGQtYTM2NC1jMjRhY2FkNThlZjAiLAogICAgICAgICAgICAicGF5ZXJGc3AiOiAibW9ja2ZzcDAxIiwKICAgICAgICAgICAgInBheWVlRnNwIjogIm1vY2tmc3AwMiIsCiAgICAgICAgICAgICJhbW91bnQiOiB7CiAgICAgICAgICAgICAgICAiYW1vdW50IjogIjEiLAogICAgICAgICAgICAgICAgImN1cnJlbmN5IjogIlVTRCIKICAgICAgICAgICAgfSwKICAgICAgICAgICAgImlscFBhY2tldCI6ICJleUp4ZFc5MFpVbGtJam9pWlRobVl6RmxZekV0WlRNNVl5MDBOREJqTFdKak5tUXRaamcxTmpNMk16WXdNVEJpSWl3aWRISmhibk5oWTNScGIyNUpaQ0k2SW1VNFptTXhaV014TFdVek9XTXRORFF3WXkxaVl6WmtMV1k0TlRZek5qTTJNREV3WWlKOSIsCiAgICAgICAgICAgICJjb25kaXRpb24iOiAidGE1aW55WWFybTd0aHZZM0RQSFpnU0RLMG1sX2t1R09QaS1xaGlGMVNDcyIsCiAgICAgICAgICAgICJleHBpcmF0aW9uIjogIjIwMTktMTItMTBUMTI6MDc6NDcuMzQ5WiIKfQ==",
        "payloadType": "application/json;encode=base64"
    },
    "metadata": {
        "event": {
            "id": "25240fa4-da6a-4f18-8b42-e391fde70817",
            "type": "prepare",
            "action": "prepare",
            "createdAt": "2019-05-06T08:53:16.996Z",
            "state": {
                "status": "success",
                "code": 0
            }
        }
    }
}

payloadType examples:

application/json;encode=utf8 - JSON Mime type which has been encoded as a raw UTF8 string
application/vnd.interoperability.transfers+json;version=1;encode=utf8 - JSON Mime type as received by the ML-API Adapter which has been encoded as a raw UTF8 string
application/vnd.interoperability.transfers+json;version=1;encode=base64 - JSON Mime type as received by the ML-API Adapter which has been encoded as a BASE64 string

Note that the performance overhead of decoding/encoding should also be taking into consideration here. I.e. UTF8 encode/decode is probably more efficient than base64 since we can take the string and directly parse it into a JSON object as required where base64 will require an additional decode.

This would ensure that the payload could then be validated against the JWS Signature as normal.

We could enhance the existing central-services-stream library to provide the capability to encode/decode the payload based on the payloadType. This would be a minimal change to the existing code-base if this solution is accepted.

Let me know what you guys think.

Alternatively, we could encode the complete content field, and rather move the encode parameter up to the `type field as per the following example:

{
    "id": "f2f038cc-b749-464d-a364-c24acad58ef0",
    "to": "mockfsp02",
    "from": "mockfsp01",
    "type": "application/json;encode=base64",
    "content": "ewogICAgICAgICJoZWFkZXJzIjogewogICAgICAgICAgICAiYWNjZXB0IjogImFwcGxpY2F0aW9uL3ZuZC5pbnRlcm9wZXJhYmlsaXR5LnRyYW5zZmVycytqc29uO3ZlcnNpb249MSIsCiAgICAgICAgICAgICJjb250ZW50LXR5cGUiOiAiYXBwbGljYXRpb24vdm5kLmludGVyb3BlcmFiaWxpdHkudHJhbnNmZXJzK2pzb247dmVyc2lvbj0xIiwKICAgICAgICAgICAgImRhdGUiOiAiMjAxOS0wMS0yMlQyMToyNzo1NS4wMDBaIiwKICAgICAgICAgICAgImZzcGlvcC1zb3VyY2UiOiAibW9ja2ZzcDAxIiwKICAgICAgICAgICAgImZzcGlvcC1kZXN0aW5hdGlvbiI6ICJtb2NrZnNwMDIiLAogICAgICAgICAgICAiY29udGVudC1sZW5ndGgiOiA0MzcKICAgICAgICB9LAogICAgICAgICJwYXlsb2FkIjogewogICAgICAgICAgICAidHJhbnNmZXJJZCI6ICJmMmYwMzhjYy1iNzQ5LTQ2NGQtYTM2NC1jMjRhY2FkNThlZjAiLAogICAgICAgICAgICAicGF5ZXJGc3AiOiAibW9ja2ZzcDAxIiwKICAgICAgICAgICAgInBheWVlRnNwIjogIm1vY2tmc3AwMiIsCiAgICAgICAgICAgICJhbW91bnQiOiB7CiAgICAgICAgICAgICAgICAiYW1vdW50IjogIjEiLAogICAgICAgICAgICAgICAgImN1cnJlbmN5IjogIlVTRCIKICAgICAgICAgICAgfSwKICAgICAgICAgICAgImlscFBhY2tldCI6ICJleUp4ZFc5MFpVbGtJam9pWlRobVl6RmxZekV0WlRNNVl5MDBOREJqTFdKak5tUXRaamcxTmpNMk16WXdNVEJpSWl3aWRISmhibk5oWTNScGIyNUpaQ0k2SW1VNFptTXhaV014TFdVek9XTXRORFF3WXkxaVl6WmtMV1k0TlRZek5qTTJNREV3WWlKOSIsCiAgICAgICAgICAgICJjb25kaXRpb24iOiAidGE1aW55WWFybTd0aHZZM0RQSFpnU0RLMG1sX2t1R09QaS1xaGlGMVNDcyIsCiAgICAgICAgICAgICJleHBpcmF0aW9uIjogIjIwMTktMTItMTBUMTI6MDc6NDcuMzQ5WiIKICAgICAgICB9CiAgICB9",
    "metadata": {
        "event": {
            "id": "25240fa4-da6a-4f18-8b42-e391fde70817",
            "type": "prepare",
            "action": "prepare",
            "createdAt": "2019-05-06T08:53:16.996Z",
            "state": {
                "status": "success",
                "code": 0
            }
        }
    }
}

With the content field containing the following once it has been de-encoded:

"content": {
    "header": <header as received by hapijs>,
    "payload": <raw payload as received by hapijs>
}"

I feel like my initial suggested approach is more explicit and would remove any confusion on how to handle the contents, and how the internal payload field should be handled.

Note that the performance overhead of decoding/encoding should also be taking into consideration here

I doubt this will be an issue but even if it is I would suggest getting this working first before worrying about optimising the encoding/decoding.

In terms of the suggestions:

If we follow your second suggestion content will always be raw binary data encoded using base64. If you are just going to put binary data in the LIME envelope why not encode the whole raw HTTP request/response as the content (i.e. don't parse the headers and body and put them in a new JSON object)? My guess is that this won't be trivial using any HTTP libraries on Node and you will need to use raw TCP sockets so that may be a good reason but it seems pointless to parse the HTTP request/response only to re-encode it as a binary blob.

For that reason I prefer your first suggestion, however it would be useful to have some explicit header normalisation rules (I didn't see any in the Signature spec) to ensure that the decoded headers are still used correctly when validating the signature.

Then I would do the following:

Treat all HTTP request/response bodies as binary data and base64url encode these as the value of content.
Look at the content-type header to determine the MIME type of the payload rather than repeat it in a payloadType property.

Note that the performance overhead of decoding/encoding should also be taking into consideration here

I doubt this will be an issue but even if it is I would suggest getting this working first before worrying about optimising the encoding/decoding.

Agreed.

I prefer your first suggestion, however it would be useful to have some explicit header normalisation rules (I didn't see any in the Signature spec) to ensure that the decoded headers are still used correctly when validating the signature.

That is correct. I do not know of any header normalization rules as per the spec other then what is validated against the API specification explicitly, nor do the headers form part of the calculated signature (other than to store the signature itself). Do you have an example of how we could bridge this world?

Then I would do the following:

Treat all HTTP request/response bodies as binary data and base64url encode these as the value of content.

Just to clarify, do you mean encode the entire value of the content field or each of the fields within the content separately (i.e. headers & payload)?

Look at the content-type header to determine the MIME type of the payload rather than repeat it in a payloadType property.

Fair enough. If your 1st point above applies to both the headers and payload fields, then I would suggest that the name of this field is more appropriately nameed encodeType which would be applicable to both fields.

All, I think it would be a mistake to prematurely generalise. Let us not worry about building a solution to any possible incoming MIME type. The API specifies UTF-8 encoded JSON only and that is highly unlikely to change any time soon. Let us worry about fixing the situation for the current spec and stack first.

We have a node.js front-end that can be configured to not parse incoming payloads using the default node JSON parser and simply give our code the raw bytes. So, we could stop parsing the bodies in the HAPI framework and do it explicitly in our code; that gives us more control of the parsing. It also allows us to store the bytestream intact before parsing.

All that may be academic however as @adrianhopebailie raises a very serious security concern. Storing and re-transmitting the incoming bytestream opens the door to us possibly passing on a message that is not the same as our parsed interpretation; if our parser has such a flaw then we are already very broken indeed, but at least the eventual receiver would know given our current approach of reconstructing the outgoing message as JSON from our js objects would invalidate the signature in this situation.

If my interpretation of @millerabel 's recent observations is correct, he suggests a change to the spec to eliminate the cases that cause signature breaks during payload reconstruction e.g. whitespace, ordering of lists and object properties, and unicode escaping. I suggest we give this suggestion a significant amount of credence and consideration as I believe it is the only solution that prevents a possible parser defect being exploited.

It looks like we could also build our own body parser middleware if needed, or take this one: https://github.com/nasa8x/hapi-bodyparser and customize it as needed.

We are taking approach to have a custom parser to parse the payload, to have it once again as a raw buffer and once as a base64 data URI prior validation happens and pass that request further the line.
Prepare a mojaloop library util functions to do the encoding and decoding of the data URI
Prepare a hapi plugin for the encoding raw payload as a data URI while preserving the hapi validations

This is now implemented and deployed on AWS. To keep it manageable, closing this out - for any new issue or regression, please log a new bug

mojaloop / project

Parsing of message body of messages in the ML Switch, JWS implications #755

Pull Requests: