stripe / stripe-terminal-ios

Stripe Terminal iOS SDK
https://stripe.com/docs/terminal/sdk/ios
Other
104 stars 65 forks source link

WisePOS Won't Stay Connected #284

Closed benshepherd-ps closed 6 months ago

benshepherd-ps commented 9 months ago

Summary

Since upgrading to the version 3.2.1 SDK (from version 2.20), our kiosk applications won't stay connected to their WisePOS readers. Our app connects to the reader when the app launches and stays connected. Since the device is running in single-app-mode, the app typically isn't backgrounded or restarted very often. This hasn't changed. What has changed is we are seeing repeated iOS network Task errors in the console: Task <3BEEAFA6-3FAF-4303-8EE0-E6030423EB2E>.<31> finished with error [-1003] Error Domain=NSURLErrorDomain Code=-1003 "A server with the specified hostname could not be found." However, the SDK doesn't register a true disconnection until our app attempts to interact with the reader in some way.

Code to reproduce

Generally, the reader will stay connected for roughly 30minutes before we start seeing repeated NSURL errors in the console: Task <3BEEAFA6-3FAF-4303-8EE0-E6030423EB2E>.<31> finished with error [-1003] Error Domain=NSURLErrorDomain Code=-1003 "A server with the specified hostname could not be found." UserInfo={_kCFStreamErrorCodeKey=8, NSUnderlyingError=0x280834060 {Error Domain=kCFErrorDomainCFNetwork Code=-1003 "(null)" UserInfo={_kCFStreamErrorDomainKey=12, _kCFStreamErrorCodeKey=8, _NSURLErrorNWResolutionReportKey=Resolved 0 endpoints in 39ms using udp from cache, _NSURLErrorNWPathKey=satisfied (Path is satisfied), interface: en0[802.11], ipv4, dns, uses wifi}}, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask <3BEEAFA6-3FAF-4303-8EE0-E6030423EB2E>.<31>, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalDataTask <3BEEAFA6-3FAF-4303-8EE0-E6030423EB2E>.<31>" ), NSLocalizedDescription=A server with the specified hostname could not be found., NSErrorFailingURLStringKey=https://gator.stripe.com:443/protojsonservice/GatorService, NSErrorFailingURLKey=https://gator.stripe.com:443/protojsonservice/GatorService, _kCFStreamErrorDomainKey=12} Timestamp: 2024-02-08 14:19:19.801086-06:00

Followed by: Background Task 25 ("stripe-terminal.scplog.flushtraces"), was created over 30 seconds ago. In applications running in the background, this creates a risk of termination. Remember to call UIApplication.endBackgroundTask(_:) for your task in a timely manner to avoid this. Timestamp: 2024-02-08 14:19:50.929380-06:00 | Category: BackgroundTask

Followed by: The same NSURL errors

Followed by: An actual unexpected disconnect from the SDK approximately 5min after the above errors start.

iOS version

iPadOS 17.3

Installation method

SPM

SDK version

3.2.1

Other information

This happens while the kiosk is idle typically. We have a status icon that shows the reader status, and it shows "connected" until the guest gets to checkout and starts the payment process. At this point, the reader will register the official disconnect and payment collection will fail. In the background, we are responding to the unexpected disconnect and re-connecting to the reader. So the 2nd attempt typically works.

Should we NOT attempt to stay connected to the reader indefinitely and instead connect when the checkout process starts?

We were excited to upgrade to 3.x to get around the occasional DNS issues that we've seen, but this issue has now shown up.

benshepherd-ps commented 9 months ago

Some other console logs are shown as well. The problems started about 42min after the last reader connection: Connection 861: received failure notification Connection 861: failed to connect 3:-9816, reason -1 Connection 861: encountered error(3:-9816) nw_connection_copy_connected_local_endpoint_block_invoke [C861] Client called nw_connection_copy_connected_local_endpoint on unconnected nw_connection nw_connection_copy_connected_remote_endpoint_block_invoke [C861] Client called nw_connection_copy_connected_remote_endpoint on unconnected nw_connection nw_connection_copy_protocol_metadata_internal_block_invoke [C861] Client called nw_connection_copy_protocol_metadata_internal on unconnected nw_connection tcp_output [C863:3] flags=[R.] seq=1694198032, ack=3555165622, win=2048 state=CLOSED rcv_nxt=3555165622, snd_una=1694198032 Connection 863: received failure notification Connection 863: failed to connect 3:-9816, reason -1 Connection 863: encountered error(3:-9816)

These seem interesting: boringssl_context_handle_fatal_alert(2072) [C867:2][0x1350d38d0] read alert, level: fatal, description: inappropriate fallback boringssl_session_handshake_incomplete(210) [C867:2][0x1350d38d0] SSL library error boringssl_session_handshake_error_print(44) [C867:2][0x1350d38d0] Error: 5121355136:error:1000043e:SSL routines:OPENSSL_internal:TLSV1_ALERT_INAPPROPRIATE_FALLBACK:/Library/Caches/com.apple.xbs/Sources/boringssl/ssl/tls_record.cc:592:SSL alert number 86 nw_protocol_boringssl_handshake_negotiate_proceed(779) [C867:2][0x1350d38d0] handshake failed at state 12288: not completed Connection 867: received failure notification Connection 867: failed to connect 3:-9860, reason -1 Connection 867: encountered error(3:-9860) nw_connection_copy_connected_local_endpoint_block_invoke [C867] Client called nw_connection_copy_connected_local_endpoint on unconnected nw_connection nw_connection_copy_connected_remote_endpoint_block_invoke [C867] Client called nw_connection_copy_connected_remote_endpoint on unconnected nw_connection nw_connection_copy_protocol_metadata_internal_block_invoke [C867] Client called nw_connection_copy_protocol_metadata_internal on unconnected nw_connection Task .<1> HTTP load failed, 0/0 bytes (error code: -1200 [3:-9860]) Task .<1> finished with error [-1200] Error Domain=NSURLErrorDomain Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSErrorFailingURLStringKey=https://192.168.7.216:4443/protojsonservice/JackRabbitService, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFStreamErrorDomainKey=3, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask .<1>, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalDataTask .<1>" ), NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://192.168.7.216:4443/protojsonservice/JackRabbitService, NSUnderlyingError=0x2809df330 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "(null)" UserInfo={_kCFStreamPropertySSLClientCertificateState=0, _kCFNetworkCFStreamSSLErrorOriginalValue=-9860, _kCFStreamErrorDomainKey=3, _kCFStreamErrorCodeKey=-9860, _NSURLErrorNWPathKey=satisfied (Path is satisfied), viable, interface: en0[802.11], ipv4, dns, uses wifi}}, _kCFStreamErrorCodeKey=-9860}

bric-stripe commented 8 months ago

sorry for the delay responding here.

Should we NOT attempt to stay connected to the reader indefinitely and instead connect when the checkout process starts?

nope, staying connected to a WisePOS E like you are should be fine.

We have a status icon that shows the reader status, and it shows "connected" until the guest gets to checkout and starts the payment process

and is it that a collectPaymentMethod is called and that fails with a disconnected reader error? (the SDK is sending heartbeat requests to the reader and should indeed disconnect on its own when those start failing...not sure yet what's going on)

The fact that requests to https://gator.stripe.com/ is getting A server with the specified hostname could not be found. as well as the requests to the reader getting SSL errors makes it sound like the device itself is in an unhappy state. The gator requests should be simple NSURLSession requests to a Stripe server. Do you always see the stripe.com errors in addition to the local network failures or was that possibly a one-off?

You're able to reproduce this 100% of the time after leaving the app idle? Aside from kiosk mode is there anything else we should know to try to reproduce? When idle the app isn't interacting with the SDK at all?

can you also include a serial number of a reader that hit this in the last 30 days along with rough time that it happened? (+/- a few hours is fine) and if you have the matching console logs from that instance that would be great too but not necessary.

benshepherd-ps commented 8 months ago

Sorry - we don't have any logs from 3 weeks ago. We were able to consistently reproduce the original issue and we were able to see it happening in production across multiple sites and kiosks (we use the Embrace.io platform for logging). Because of this, we changed the architecture to only connect to the reader when the checkout flow is started, and disconnect when the session ends. We are now on 3.3.1 as well. When our app starts (which can sometimes be only once a week), we start discovery and cache the reader object that was previously connected to. Since the new SDK specifies that we don't need to start discovery to connect, we simply connect directly to this reader when the guest checkout starts and disconnect when checkout is complete and the session ends. Our error rates have dropped pretty significantly with this approach. We have seen the occasional "reader disconnected unexpectedly" but so far these are minimal.

The original issue and the unexpected disconnects could be as a result of something on our Wi-FI networks (IP lease renewals, access point reboots, etc) but when it occurs it tends to not correct itself, as we do try to re-connect automatically when this happens.

I can provide reader serial numbers and timestamps for recent "unexpected disconnects" if that helps, but not for the original issue which was a "silent" disconnection due to an SSL error and the SDK didn't fire the delegate to let us know.

Our app doesn't need to interact much with the Stripe SDK until checkout, especially now that we aren't staying connected to the reader at all times. At that point, we start the payment flow by creating an intent and collecting payment. We put in a guard at this point to ensure we are connected to the reader and if not, we show an alert to ask an associate for help.

With the original issue, I don't believe we ALWAYS saw failures with https://gator.stripe.com/ - many times we'd just see multiple connection errors when calling the reader directly via IP (Embrace logs all network traffic so we can see this in production).

bric-stripe commented 8 months ago

The original issue and the unexpected disconnects could be as a result of something on our Wi-FI networks (IP lease renewals, access point reboots, etc)

I was just discussing this with a colleague and the IP address changing came up. Going to try to force that to see if we have an issue there where the SDK doesn't handle that correctly. Though if that were the case I would expect the caching of SCPReader instances to have a similar issue where the IP cached in the reader instance is no longer correct and so the connect would fail and require a new discover call. So if that approach is working well it may not be that.

I can provide reader serial numbers and timestamps for recent "unexpected disconnects" if that helps

yeah, I'd still like to take a look at a few just to see if anything jumps out.

and just out of curiosity: have you at least seen any DNS issues have gone away since upgrading to 3.x?

benshepherd-ps commented 8 months ago

I was just discussing this with a colleague and the IP address changing came up. Going to try to force that to see if we have an issue there where the SDK doesn't handle that correctly. Though if that were the case I would expect the caching of SCPReader instances to have a similar issue where the IP cached in the reader instance is no longer correct and so the connect would fail and require a new discover call. So if that approach is working well it may not be that.

Agree that would be an issue if the IP addresses were changing. We do log the IP address of the reader upon connection to Embrace. We haven't seen it often, but pretty sure it has happened. As long as the iPad restarts following any sort of network event, it should re-discover and pick up the correct reader object and IP.

I can provide reader serial numbers and timestamps for recent "unexpected disconnects" if that helps

yeah, I'd still like to take a look at a few just to see if anything jumps out.

Here's one example from earlier today: 12:35:21 PM central Reader Serial#: WSC513142045154 The logs show successful heartbeat connections to the reader followed by an unexpected disconnect. Embrace screenshot:

Screenshot 2024-02-29 at 1 26 58 PM

and just out of curiosity: have you at least seen any DNS issues have gone away since upgrading to 3.x?

Yes, we believe so but it was sometimes hard to tell what the actual issue was. We previously set up a separate network at each site with the Stripe-specified custom DNS settings. Now, we are moving the readers and iPads back to the core network and seeing good success. Previously, we would have rampant issues whenever some event happened (like a power outage or other network event) and we'd have to manually flush DNS caches when the network came back up to get the readers to connect.

bric-stripe commented 6 months ago

@benshepherd-ps I'm going to close this out since you've got a workaround and we haven't been able to reproduce and haven't had other reports of this. But please re-open if this is still causing you trouble or have any more info that may help us debug this.