MTOM in SOAP in production

RasmusThernoe commented 5 years ago

We are experiencing: "Unable to get certificate from dom"

This is the exact same issue as we experienced in GitHub issue https://github.com/scandihealth/lpr3-docs/issues/48 in the test environment.

Our request is send at: 2019-01-24 12:30:09,956

TueCN commented 5 years ago

@RasmusThernoe I don't know if this is your request, but I just saw the following exception in our logs:

2019-01-24 13:52:16,932 WARNING [...] ID kortet har timeout.

This fault would be because the IssueInstant + TimeOut < now

RasmusThernoe commented 5 years ago

We have not send any requests since around 12:30.

That sounds like the Capitol Region - who have had problems with timeout of the ID card.

RasmusThernoe commented 5 years ago

Any news?

We need to get reports to LPR3 today as we need the accumulated error list tonight.

TueCN commented 5 years ago

I'm not sure the cause of the error message "Unable to get certificate from dom" is the same as in #48. Our signature validation in production is exactly the same as on https://lprws-test.sds.dsdn.dk

Can you get one of your requests signed by both STS test and STS prod and compare the two outbound messages to see if there are any glaring differences? Is your formatting/processing of the signed requests identical for prod and test?

Maybe this clarification stacktrace from our log can help you identify the issue?

dk.sosi.seal.model.ModelException: Unable to get certificate from dom
        at dk.sosi.seal.model.SignatureUtil.resolveCertificate(SignatureUtil.java:695)
...
Caused by: org.apache.xml.security.keys.keyresolver.KeyResolverException: Could not parse certificate: java.io.IOException: Invalid BER/DER data (too huge?)
Original Exception was java.security.cert.CertificateException: Could not parse certificate: java.io.IOException: Invalid BER/DER data (too huge?)
        at org.apache.xml.security.keys.keyresolver.implementations.X509CertificateResolver.engineLookupResolveX509Certificate(X509CertificateResolver.java:110)
        at org.apache.xml.security.keys.KeyInfo.applyCurrentResolver(KeyInfo.java:870)
        at org.apache.xml.security.keys.KeyInfo.getX509CertificateFromStaticResolvers(KeyInfo.java:853)
        at org.apache.xml.security.keys.KeyInfo.getX509Certificate(KeyInfo.java:819)
        at dk.sosi.seal.model.SignatureUtil.resolveCertificate(SignatureUtil.java:693)

RasmusThernoe commented 5 years ago

We will compare the two outbound messages for glaring differences.

We can provide the input messages if you have a secure channel?

The input messages contains both patient data and production certificate.

JacobBangSSE commented 5 years ago

I guess the error comes from parsing the X509Certificate part of the message. Well, I have tried to get the message we are trying to send (by using ncat) and taking the certificate part of this output and it seems openssl can parse it without any problems as DER formatted.

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 1457554774 (0x56e08556)
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=DK, O=TRUST2408, CN=TRUST2408 OCES CA II
        Validity
            Not Before: May 12 08:55:19 2016 GMT
            Not After : May 12 08:55:02 2019 GMT
        Subject: C=DK, O=Sundhedsdatastyrelsen // CVR:33257872/serialNumber=CVR:33257872-FID:55008930, CN=SOSI Federation 2 (funktionscertifikat)
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
            RSA Public Key: (2048 bit)
                Modulus (2048 bit):
                    00:90:51:f5:fe:23:2d:bf:8d:e7:d3:05:ac:37:82:
                    ab:e9:fd:f2:34:b6:a0:90:64:38:ec:f1:fa:46:b3:
                    58:08:67:08:5a:ff:45:78:53:d5:54:79:c9:76:fc:
                    3d:41:db:65:c6:3a:76:09:67:05:9c:b4:de:f3:92:
                    4f:0a:44:6d:bc:07:6e:33:d5:0d:3f:7e:b9:ce:06:
                    d1:2b:5e:93:25:5b:d2:02:24:8f:ed:b1:da:eb:59:
                    da:eb:7a:1e:34:5f:d8:2b:68:af:8a:d0:4a:d8:b3:
                    80:96:6d:db:63:03:34:83:1c:55:09:56:ff:ca:63:
                    92:25:86:ed:bc:f2:91:4b:d9:e4:77:2a:e5:1b:ef:
                    62:15:13:41:a9:eb:22:cb:a6:f0:87:19:44:1e:19:
                    bf:96:93:2f:a0:c6:00:f1:11:3c:ed:d4:b8:14:e8:
                    97:b0:76:c1:41:c8:3f:bc:28:c0:d2:04:e7:2f:85:
                    57:11:11:1a:df:bc:36:56:5f:77:84:56:e4:fd:35:
                    c5:9e:0c:5f:6c:34:25:b0:a6:10:4a:f3:07:d5:f8:
                    c5:a6:44:71:60:1f:a6:50:ba:69:a6:7d:8b:6e:98:
                    f5:c5:a9:41:59:8a:16:a8:d0:72:86:fc:28:61:d7:
                    4d:58:05:c8:4a:0a:5e:90:b7:e2:30:82:69:f9:b5:
                    7d:7b
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Data Encipherment, Key Agreement
            Authority Information Access:
                OCSP - URI:http://ocsp.ica02.trust2408.com/responder
                CA Issuers - URI:http://f.aia.ica02.trust2408.com/oces-issuing02-ca.cer

            X509v3 Certificate Policies:
                Policy: 1.2.208.169.1.1.1.4.2
                  CPS: http://www.trust2408.com/repository
                  User Notice:
                    Organization: TRUST2408
                    Number: 1
                    Explicit Text: For anvendelse af certifikatet g▒lder OCES vilk▒r, CPS og OCES CP, der kan hentes fra www.trust2408.com/repository. Bem▒rk, at TRUST2408 efter vilk▒rene har et begr▒
nset ansvar ift. professionelle parter.

            X509v3 CRL Distribution Points:
                URI:http://crl.ica02.trust2408.com/ica02.crl
                DirName:/C=DK/O=TRUST2408/CN=TRUST2408 OCES CA II/CN=CRL4079

            X509v3 Authority Key Identifier:
                keyid:99:8F:BA:0D:89:AE:21:1A:42:7A:0A:AE:1A:4C:4E:22:FF:10:EB:8C

            X509v3 Subject Key Identifier:
                82:21:F4:6B:02:19:DE:7F:61:03:4F:DC:30:9C:24:CC:A8:19:A3:D2
            X509v3 Basic Constraints:
                CA:FALSE
    Signature Algorithm: sha256WithRSAEncryption
        7c:6a:bc:17:2b:14:42:e7:73:26:63:e9:86:6f:a6:9c:e4:e0:
        df:fa:48:06:60:e3:b9:d9:38:c6:88:a3:81:5f:03:dc:7c:17:
        ff:2a:79:86:49:60:74:dd:77:e7:bd:c2:23:c4:01:f6:ee:21:
        d9:84:aa:ad:0d:2d:59:4b:67:86:6d:8e:36:82:c9:04:ca:5c:
        f4:d2:ca:44:84:e0:a5:21:c3:6e:2b:c8:d9:5e:9d:dd:38:9b:
        9b:c8:bb:39:5e:94:82:2b:02:a0:03:15:38:b8:5a:31:9c:ba:
        30:04:3f:b3:da:4f:9d:df:6b:de:b0:49:46:2c:9b:a2:49:a7:
        c1:a2:3a:e3:28:08:62:66:a0:90:ce:de:f1:b9:7e:50:8f:22:
        46:b2:e5:7c:2b:63:d6:75:74:ee:a3:35:75:60:aa:19:54:02:
        4e:5a:4b:2b:89:aa:3b:56:35:62:7a:17:4e:61:fd:7d:e0:a2:
        d5:43:ab:dc:d1:6c:0e:4a:2e:54:7a:dc:15:ad:ab:63:3d:0e:
        44:e4:93:99:f4:24:13:dd:00:f5:ff:d3:a8:70:31:e9:f4:4c:
        ed:b0:fc:05:53:17:21:e2:88:44:da:73:39:69:5f:03:e3:71:
        1c:be:2c:55:70:f0:5f:8b:3a:9a:c6:33:d1:84:68:e1:de:b5:
        e1:11:97:56

I would really like to send out TCP dump but it contains data which should not be posted on a public github. Do you have any secure channel where we can send the dump?

TueCN commented 5 years ago

Hmm we could enable "raw" logging of all requests to the service for a brief period to catch your request and see what it looks like when we receive it, but I don't know if that would violate GDPR or other rules and regulations.

Your requests must differ somehow from the requests of the other reporters, since they do not have the same problem (not that your request necessarily isn't compliant).

I am afraid I can't provide much assistance on debugging this issue in prod tonight. I could work on trying to give you access to our internal load balanced test environment in DXC (should be reachable from Sundhedsdatanettet) if you would like to try getting a request accepted there (must be signed by STS test). There we can also enable logging without punity and you can send some scrambled/fake data.

jonigkeit commented 5 years ago

We suggest you create a submission but with test patient data. You sign the request as you do in production, i.e. with production STS. Instead of sending it to the production endpoint of lpr you send it to test, i.e. https://lprws-test.sds.dsdn.dk/cda-ws/DocumentRepository_Service/PatientHealthcareValidateReportingService

TueCN commented 5 years ago

If you would like to transfer one of your prod signed requests that are rejected, you can upload it to LPR's FTP server (which is secured).

We have created a new folder named "gh288" in the "LPRRM" users root folder, which the user should have write access to. Is this an acceptable secure channel?

JacobBangSSE commented 5 years ago

We have now sent the same message (extracted from production) to the test-service (PatientHealthcareReportingService) by using SoapUI against. So the message are signed and contains the production STSIDCARD for production but are sent to the test service.

The timestamp for this attempt is: Fri, 25 Jan 2019 10:11:32 GMT (so 11:11:32 in Denmark)

I got the following response from the service:

<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope">
   <env:Header xmlns:env="http://www.w3.org/2003/05/soap-envelope"/>
   <soap:Body>
      <soap:Fault>
         <soap:Code>
            <soap:Value>soap:Receiver</soap:Value>
         </soap:Code>
         <soap:Reason>
            <soap:Text xml:lang="en">The certificate that signed the security token is not trusted!</soap:Text>
         </soap:Reason>
      </soap:Fault>
   </soap:Body>
</soap:Envelope>

So it seems like the test service understands the certificate part of the message. But I cannot leave out the possibility that SoapUI are doing something with how we send the data which makes a difference.

I am right now trying to send the raw message from production to test endpoint by using a combination of stunnel and ncat. But this are going to take some time since I have some technical problems with my setup.

jonigkeit commented 5 years ago

@RasmusThernoe and @JacobBangSSE We would very much appreciate if you uploaded your request to the aforementioned folder on the SFTP server, so that we can work concurrently on resolving this issue.

finnha commented 5 years ago

I have uploaded the NCAT dump to the folder on the SFTP server.

jonigkeit commented 5 years ago

We have started logging requests in production. Please try to submit another request

finnha commented 5 years ago

Now I have submitted a new request

TueCN commented 5 years ago

We have reproduced the "cannot get certificate from dom" error from the request uploaded to FTP. Sending the exact same request to https://lprws-test.sds.dsdn.dk also yields the exact same error response.

We are in the process of verifying the test method by seeing that one of our own generated requests does not yield "cannot get certificate from dom" error when sending in the same way.

However, we also just saw a new type of error in our logs we haven't seen before at all: 2019-01-25 12:41:13,154 WARNING [...]: Couldn't find MIME boundary: --_NextPart_000_0002_01C3E1CC.3BB37320

Is this from one of your requests or someone else?

finnha commented 5 years ago

Mine requests was from 12:42:xx I don't think the error is from mine requests

TueCN commented 5 years ago

There were many warnings of this type in the time between 12:41-12:44 2019-01-25 12:44:27,674 WARNING [...]: Couldn't find MIME boundary: --_NextPart_000_0002_01C3E1CC.3BB37320

You should probably see a similar error in your responses? We are not actively pursuing this new warning now, it was just to let you know that there is a new type of error we haven't seen before.

TueCN commented 5 years ago

@JacobBangSSE @RasmusThernoe we have now verified our test method. We do not get "Unable to get certificate from dom" when sending one of our own generated requests in the same way (via curl).

This implies that the requests you generate for prod must be different in some way to the requests you generate for test (https://lprws-test.sds.dsdn.dk) as you normally have no problem getting requests accepted on test.

We are in the process of investigating the differences between our and your request to see what could cause the issue.

TueCN commented 5 years ago

Can you attach here a dump of one of your "test" requests that you normally send to https://lprws-test.sds.dsdn.dk and which does not give the "Unable to get certificate from dom" error?

RasmusThernoe commented 5 years ago

We will perform that task.

Since we have real patient data in our "outbox" we would like to use the endpoints that validate without persisting.

First we will send our request to the test endpoint using a test certificate to:

https://lprws-test.sds.dsdn.dk/cda-ws/DocumentRepository_Service/PatientHealthcareValidateReportingService

Then we will send our request to the production endpoint using a production certificate to:

https://lprws.sds.dsdn.dk/cda-ws/DocumentRepository_Service/PatientHealthcareValidateReportingService

Will this work to get the right compare?

Do you have logging applied to both endpoints?

jonigkeit commented 5 years ago

When inline the xop included certificate we get the following invalid PEM

MIIGHjCCBQagAwIBAgIEVuCFVjAKBgkqhkiG9woBAQsFADBAMQswCQYDVQQGEwJE
SzESMBAGA1UECgwJVFJVU1QyNDA4MR0wGwYDVQQDDBRUUlVTVDI0MDggT0NFUyBD
QSBJSTAeFwoxNjA1MTIwODU1MTlaFwoxOTA1MTIwODU1MDJaMIGRMQswCQYDVQQG
EwJESzEuMCwGA1UECgwlU3VuZGhlZHNkYXRhc3R5cmVsc2VuIC8vIENWUjozMzI1
Nzg3MjFSMCAGA1UEBRMZQ1ZSOjMzMjU3ODcyLUZJRDo1NTAwODkzMDAuBgNVBAMM
J1NPU0kgRmVkZXJhdGlvbiAyIChmdW5rdGlvbnNjZXJ0aWZpa2F0KTCCASIwCgYJ
KoZIhvcKAQEBBQADggEPADCCAQoCggEBAJBR9f4jLb+N59MFrDeCq+n98jS2oJBk
OOzx+kazWAhnCFr/RXhT1VR5yXb8PUHbZcY6dglnBZy03vOSTwpEbbwHbjPVCj9+
uc4G0StekyVb0gIkj+2x2utZ2ut6HjRf2Ctor4rQStizgJZt22MDNIMcVQlW/8pj
kiWG7bzykUvZ5Hcq5RvvYhUTQanrIsum8IcZRB4Zv5aTL6DGAPERPO3UuBTol7B2
wUHIP7wowNIE5y+FVxERGt+8NlZfd4RW5P01xZ4MX2w0JbCmEErzB9X4xaZEcWAf
plC6aaZ9i26Y9cWpQVmKFqjQcob8KGHXTVgFyEoKXpC34jCCafm1fXsCAwEAAaOC
AswwggLIMA4GA1UdDwEB/wQEAwIDuDCBiQYIKwYBBQUHAQEEfTB7MDUGCCsGAQUF
BzABhilodHRwOi8vb2NzcC5pY2EwMi50cnVzdDI0MDguY29tL3Jlc3BvbmRlcjBC
BggrBgEFBQcwAoY2aHR0cDovL2YuYWlhLmljYTAyLnRydXN0MjQwOC5jb20vb2Nl
cy1pc3N1aW5nMDItY2EuY2VyMIIBQwYDVR0gBIIBOjCCATYwggEyBgoqgVCBKQEB
AQQCMIIBIjAvBggrBgEFBQcCARYjaHR0cDovL3d3dy50cnVzdDI0MDguY29tL3Jl
cG9zaXRvcnkwge4GCCsGAQUFBwICMIHhMBAWCVRSVVNUMjQwODADAgEBGoHMRm9y
IGFudmVuZGVsc2UgYWYgY2VydGlmaWthdGV0IGfmbGRlciBPQ0VTIHZpbGvlciwg
Q1BTIG9nIE9DRVMgQ1AsIGRlciBrYW4gaGVudGVzIGZyYSB3d3cudHJ1c3QyNDA4
LmNvbS9yZXBvc2l0b3J5LiBCZW3mcmssIGF0IFRSVVNUMjQwOCBlZnRlciB2aWxr
5XJlbmUgaGFyIGV0IGJlZ3LmbnNldCBhbnN2YXIgaWZ0LiBwcm9mZXNzaW9uZWxs
ZSBwYXJ0ZXIuMIGXBgNVHR8EgY8wgYwwLqAsoCqGKGh0dHA6Ly9jcmwuaWNhMDIu
dHJ1c3QyNDA4LmNvbS9pY2EwMi5jcmwwWqBYoFakVDBSMQswCQYDVQQGEwJESzES
MBAGA1UECgwJVFJVU1QyNDA4MR0wGwYDVQQDDBRUUlVTVDI0MDggT0NFUyBDQSBJ
STEQMA4GA1UEAwwHQ1JMNDA3OTAfBgNVHSMEGDAWgBSZj7oKia4hGkJ6Cq4aTE4i
/xDrjDAdBgNVHQ4EFgQUgiH0awIZ3n9hA0/cMJwkzKgZo9IwCQYDVR0TBAIwADAK
BgkqhkiG9woBAQsFAAOCAQEAfGq8FysUQudzJmPphm+mnOTg3/pIBmDjudk4xoij
gV8D3HwX/yp5hklgdN13573CI8QB9u4h2YSqrQotWUtnhm2ONoLJBMpc9NLKRITg
pSHDbivI2V6d3Tibm8i7OV6UgisCoAMVOLhaMZy6MAQ/s9pPnd9r3rBJRiybokmn
waI64ygIYmagkM7e8bl+UI8iRrLlfCtj1nV07qM1dWCqGVQCTlpLK4mqO1Y1YnoX
TmH9feCi1UOr3NFsDkouVHrcFa2rYz0OROSTmfQkE90A9f/TqHAx6fRM7bD8BVMX
IeKIRNpzOWlfA+NxHL4sVXDwX4s6msYz0YRo4d614RGXVgo=

Either we decode wrong, you encode wrong or the input is invalid.

JacobBangSSE commented 5 years ago

You have encoded it wrong. If you remove all newlines, and decode the BASE64 you can see all "newlines" in the files ends with CRLF bits. This problem are introduced when you copy binary data in Windows where you have mixed types of line endings. Windows thinks it should try replace all the inconsistent line endings and add CRLF to the end.

I found that problem yesterday and used several hours of debugging before I open a HEX editor and found out the difference.

finnha commented 5 years ago

Done this:

Then we will send our request to the production endpoint using a production certificate to:

https://lprws.sds.dsdn.dk/cda-ws/DocumentRepository_Service/PatientHealthcareValidateReportingService

finnha commented 5 years ago

I was a little in a hurry, trying again; sorry

TueCN commented 5 years ago

@RasmusThernoe @finnha Sorry you misunderstood me. We want you to upload a dump to the FTP site in the "gh288" folder, like you did before (we are not currently monitoring all TCP traffic on prod and test).

What we want is a dump of one of your TEST messages that gets accepted on test (we already have one of your PROD messages)

RasmusThernoe commented 5 years ago

OK. We will create a dump with test certificate and upload to the FTP site.

jonigkeit commented 5 years ago

You have encoded it wrong. If you remove all newlines, and decode the BASE64 you can see all "newlines" in the files ends with CRLF bits. This problem are introduced when you copy binary data in Windows where you have mixed types of line endings. Windows thinks it should try replace all the inconsistent line endings and add CRLF to the end.

I found that problem yesterday and used several hours of debugging before I open a HEX editor and found out the difference.

I will download the file directly from the sftp site onto linux, hoping that the error did not occur prior to uploading it to the sftp site 🙏

jonigkeit commented 5 years ago

The uploaded file contains DOS endings

jonigkeit commented 5 years ago

We will be trying the soap-ui project to see if we're able to XOP/MTOM encode the STS test certificate and send it to production, if we get rejected because of an invalid certificate, i.e.. test vs prod, the likelihood of the error being on the receiving part should be minimal.

jonigkeit commented 5 years ago

We have found the problem

Load balancing manipulates the content in a way that corrupts non-utf-8 data, e.g. the XOP/MTOM encoded certificate.

Workaround Set the HTTP Header: X-Record-Target to 1

POST /cda-ws/DocumentRepository_Service/PatientHealthcareValidateReportingService HTTP/1.1\r\n
X-Record-Target: 1\r\n
...

RasmusThernoe commented 5 years ago

Could it be a temporary workaround to bypass the load balancer?

We would like to send our CDA documents and get the accumulated error list.

jonigkeit commented 5 years ago

yes if you set the header to a fixed value you should be fine

JacobBangSSE commented 5 years ago

We are not able to set custom HTTP header values in our integration platform (TIBCO). So this workaround are going to be rather complicated for us to do and not something we can do directly at our customer right now.

Therefore we hope it is possible to make a workaround at your end.

TueCN commented 5 years ago

Hmm we can deploy a bugfix that should fix it. Just need to get clearance for an unscheduled deployment activity

Update: We got go ahead, preparing deployment, will write once deployed

TueCN commented 5 years ago

Deployed. Please try sending a new requset now.

Note that if your certificate is validated, the first call to the web service can take up to 1 minute due to "cold start" of caches etc (something we will fix in the future).

jonigkeit commented 5 years ago

Btw: we have updated the soap-ui project to send a XOP/MTOM encoded certificate

jonigkeit commented 5 years ago

@RasmusThernoe @JacobBangSSE @finnha any news?

RasmusThernoe commented 5 years ago

In order to send to the production endpoint we need to sign a production certificate.

@finnha is currently in transit between the office and home therefore the wait. He will try later tonight.

jonigkeit commented 5 years ago

ok

TueCN commented 5 years ago

Good news: I can see you have gotten through to the service! Bad news: I see some very strange behavior... Extremely long response times and lots of exceptions..

I am looking into it. I might need to restart the servers

finnha commented 5 years ago

Yes; it is good news And; 26 requests pr. minut is not good

TueCN commented 5 years ago

Sorry this might take a few more minutes. The distributed cache is freaking out like I've not seen before. I will try and shut down all servers and only start one server to begin with, to avoid synchronization issues. Stand by.

TueCN commented 5 years ago

Ok we have normal service with 1 server now :) Response times looks good on our end. I will add additional servers so you can crank up concurrency for extra throughput if desired

finnha commented 5 years ago

Yes; i can see it rolling :-)

finnha commented 5 years ago

Now; i have sent app. 2500 requests

TueCN commented 5 years ago

All servers are up and everything looks normal.

Sorry for the holdup, you guys are apparently the only ones (who tested so far) that encode the certificate in non UTF-8 characters.

Good luck with the execution of the rest of your tests :)

finnha commented 5 years ago

Thank you; we are moving on, I expect to throw 250.000 requests after you :-) good week-end to you; so far ;-)

KirstenLHSDS commented 5 years ago

Good work! Thank you all of you :-)

RasmusThernoe commented 5 years ago

Closing.

https://github.com/scandihealth/lpr3-docs/issues/291:

Tomorrow we will deploy our load-balancing middleware on TEST (even though there is only 1 back-end server) to make it as similar to PROD as possible.

scandihealth / lpr3-docs

MTOM in SOAP in production #288