minvws / nl-covid19-coronacheck-app-android

European Union Public License 1.2
44 stars 28 forks source link

Adjust the time used for domestic QR code generation with clock skew #75

Closed hvisser closed 2 years ago

hvisser commented 3 years ago

For various reasons the device clock might be set incorrectly, whilst verification of the QR code by the scanner app requires an accurate timestamp for the challenge. To make generation of the code more resilient, the server time is used to adjust the timestamp used for generation to the "actual" timestamp.

As this timestamp is determined when the app configuration is retrieved only and not persisted, the adjustment currently requires a the app to be online when the app config is checked. This is probably OK for the common case in NL where most phones will be online.

During testing I noticed that the determined clock difference is larger than it actually is (around ~4 sec on my device). This might be caused by processing the response that seems to take quite some time, but in practice that extra offset doesn't affect this functionality, as long as it's under the threshold of maximum configured time skew.

This PR depends on changes to the nl-covid19-coronacheck-mobile-core; as this project uses a binary (unversioned) blob of that library from a private repository the version included with this PR most likely must be updated after merging both change to mobile core (https://github.com/minvws/nl-covid19-coronacheck-mobile-core/pull/4) and this PR.

This fixes #67 though the message has not been removed or disabled; it is probably still useful for cases where the clock skew is not automatically corrected and to alert a user that the device time in general might be incorrect.

hvisser commented 3 years ago

Nice!

Not sure why aar file is changed.

Yes, this is what I meant with "This PR depends on changes to the nl-covid19-coronacheck-mobile-core; as this project uses a binary (unversioned) blob of that library from a private repository the version included with this PR most likely must be updated after merging both change to mobile core (minvws/nl-covid19-coronacheck-mobile-core#4) and this PR." in the description. In order to actually make this change I need to depend on changes in the mobile-core PR and as this library is not deployed or built as part of the app build there's no other option but to include a binary version of it.

My suggestion would be to (internally) merge this PR after merging the mobile-core PR and then update the aar using the script that is in this repo. I can't do that myself because the repository referenced is private.

Not sure if PR requires new test written.

Nothing substantially has changed here, and I think the code calling the mobile core here is already under test as far as I can tell.

emartynov commented 3 years ago

Nice! Not sure why aar file is changed.

Yes, this is what I meant with "This PR depends on changes to the nl-covid19-coronacheck-mobile-core; as this project uses a binary (unversioned) blob of that library from a private repository the version included with this PR most likely must be updated after merging both change to mobile core (minvws/nl-covid19-coronacheck-mobile-core#4) and this PR." in the description. In order to actually make this change I need to depend on changes in the mobile-core PR and as this library is not deployed or built as part of the app build there's no other option but to include a binary version of it.

My suggestion would be to (internally) merge this PR after merging the mobile-core PR and then update the aar using the script that is in this repo. I can't do that myself because the repository referenced is private.

Not sure if PR requires new test written.

Nothing substantially has changed here, and I think the code calling the mobile core here is already under test as far as I can tell.

Sorry, I tend to read PR description to quick.

About the test - you're returning Clock with the new method. That didn't have test, and that should be possible to cover with unit test.

confiks commented 3 years ago

I understand that this PR solves the problem for your situation, where you have an online phone, and apparently the phone automatic time is skewed consistently for too many seconds. Unfortunately, Android prefers to use phone network time instead of NTP time, as iOS uses. At the moment, we're only seeing some Fairphones and Nokia's having the skewing issue consistently, but please let us know if other models are also affected. Users should see the clock deviation warning in that case (if they don't, it's probably not the same issue).

This leads leads to a red screen in combination with a maximum difference of 60 seconds between the verifier time and the time embedded in the QR-code, and the fact that only every 30 seconds a new QR-code is generated at the holder side.

The solution of this PR is not sufficient for a large group of users who don't use CoronaCheck in combination with an internet connection on location. Even if the caching issues with this PRs getAdjustedClock are fixed. CoronaCheck is meant to be usable in an offline setting. Caching the server time leads to problems in edge cases where the user then manually sets the time, reboots the phone, etc, potentially during 28 days of not using the app with an internet connection.

However, you're right that the current behavior is not sufficient with phones that consistently skew the time on automatic settings. We're currently proposing to solve this in a different way, namely to increase the validity time of the QR code from 60 to 90 seconds, and to regenerate the QR-code every 15 seconds. This allows the time to be set manually as well. The texts in the app should then note this, so affected users are able to correct it by manually setting the time to the minute.

Additionally, the server time could be used as a primary method, with the phone time as backup mechanism, but there need to be lots of guards around that to detect manual intervention and prevent scenario's where users can't recover themselves.

hvisser commented 3 years ago

The solution of this PR is not sufficient for a large group of users who don't use CoronaCheck in combination with an internet connection on location.

I find that an interesting statement, as you also imply that only a small fraction of phones are affected. Doesn't every bit help? Especially in NL it's likely that most interactions with the CoronaCheck app will be in a connected scenario IMHO, and there would be definitely be a nice improvement for those on affected phones, further decreasing the amount of failures. Wouldn't you agree that in stead of striving for the perfect solution, this is already a big improvement? Syncing the time against some kind of source whilst offline is obviously not trivial and probably not very feasible.

I'm kind of confused that on the one hand the argument is that not a lot of phones are affected and therefore this fix is overkill, but on the other hand it doesn't work whilst offline and thus the fix is not sufficient.

Even if the caching issues with this PRs getAdjustedClock are fixed. CoronaCheck is meant to be usable in an offline setting. Caching the server time leads to problems in edge cases where the user then manually sets the time, reboots the phone, etc, potentially during 28 days of not using the app with an internet connection.

I assume your concern is that the skew is cached in the ClockDeviationUseCase? There could be definitely be a mitigation for this by listening to the system broadcast when the time is adjusted if that turns out to be an issue, but note that the time is only cached whilst the process is alive, nothing is persisted on the phone for this. So even if the app ends up using the "wrong" offset, rebooting or just killing the app and restarting it would fix this. And apps that are only used once in a while like CoronaCheck will also be killed by the OS whilst in the background, so I'd expect that caching the time offset in the process wouldn't be a huge issue in this case. Additionally every time the user returns to the app the app config is retrieved which would then "resync" the clock skew too.

Additionally, the server time could be used as a primary method, with the phone time as backup mechanism, but there need to be lots of guards around that to detect manual intervention and prevent scenario's where users can't recover themselves.

This is effectively what is happening when you correct local time by the offset to make it sync to server time right?

Like I pointed out in the issue linked is that the frustrating bit about this (for me as a user) is that I get a warning and I have no way to actually correct it in a meaningful way. This PR helps in that regard, it will still tell the user that something is wrong, but won't "blame" them with a red screen in most cases with this mitigation in place. Even if their phone is way off, in case someone actually prefers to set their time manually for some reason.

confiks commented 3 years ago

Especially in NL it's likely that most interactions with the CoronaCheck app will be in a connected scenario IMHO

Indeed, in the majority of cases people are connected. But please keep in mind that there are a lot of people who only have an internet connection via WiFi, their mobile budget has been exceeded, or Vodafone is dead twice in the same week, among other reasons. CoronaCheck is an offline-by-default app; we cannot solve this problem by just assuming connectivity. Any solution must take offline-by-default as starting point.

Counting helpdesk reports about an unsolvable clock skew warnings versus connectivity issues overwhelmingly favors the latter. That's not to say this issue is in a non-issue.

I'm kind of confused that on the one hand the argument is that not a lot of phones are affected and therefore this fix is overkill

I'm not trying to make an 'overkill' argument at all. My argument is that the fix as in the PR is wrong for many people without constant connectivity. It can be improved to be sufficient though, as I said in the last paragraph of my previous response, and which you're now also getting at in the bit about "if the app ends up using the "wrong" offset, rebooting or just killing the app and restarting it would fix this". We must be sure it is able to fail gracefully while being offline for multiple days without the app being active, and changing the time.

I assume your concern is that the skew is cached in the ClockDeviationUseCase?

AFAIK the server time and skew isn't cached at the moment. hasDeviation guards against that and returns false by default, ~but getAdjustedClock does not, which is a solvable bug in this implementation~ I guess that's fine as you're offsetting the clock with zero milliseconds. But caching has it's own problems, as explained above.

If you're intention is not to cache, I misunderstood that part, and this fix would be a lot less aggressive. It would still need most of the guards that hasDeviation has around checking system uptime, and returning the raw clock if it detects any clock changes, or just the same calculation as with hasDeviation.

There could be definitely be a mitigation for this by listening to the system broadcast when the time is adjusted

Any system broadcasts (which is this specifically, by the way?) may not be picked up there as the may not be active at that time.

This is effectively what is happening when you correct local time by the offset to make it sync to server time right?

Well, in some way. But setting the time manually is very visible to the user. The (cached) server time is invisible, and in any case not something we would want to make visible.

Like I pointed out in the issue linked is that the frustrating bit about this (for me as a user) is that I get a warning and I have no way to actually correct it in a meaningful way.

Yes, I understand that frustration. I'm sorry we haven't yet fixed this issue for those phones. You're still able to make it go away by waiting for a whole minute, then setting the time, right? (and fixing potential red screens in the process) I do understand that's not what the message says at all. We're in any case going to solve it so you'll be able to correct it manually as outlined above (and changing the message to tell that).

confiks commented 3 years ago

(FYI if you're reading via mail: I made some edits to the previous comment)

jacobras commented 3 years ago

I understand that this PR solves the problem for your situation, where you have an online phone, and apparently the phone automatic time is skewed consistently for too many seconds. Unfortunately, Android prefers to use phone network time instead of NTP time, as iOS uses. At the moment, we're only seeing some Fairphones and Nokia's having the skewing issue consistently, but please let us know if other models are also affected. Users should see the clock deviation warning in that case (if they don't, it's probably not the same issue).

My mother is also experiencing this issue with her Nokia phone. She's been denied access to a venue twice because the QR code was invalid. At first I thought her mobile data was turned off (this was not the case/issue) but when testing the app I was then surprised to see the clock notice, indicating the app knows about the time difference but doesn't put that data to good use. (The time on the device was set to automatic 🕛).

CoronaCheck is an offline-by-default app; we cannot solve this problem by just assuming connectivity. Any solution must take offline-by-default as starting point.

We have now replaced CoronaCheck with printed QR codes, but this PR can still make the app more robust for a lot of other Nokia/Fairphone users. I understand that the app needs to be offline first and I applaud that approach. That, however, should not mean it cannot work a bit smarter and be more error resilient when a network connection is available.

In fact, we at Albert Heijn have dealt with the same issue for the Purchase Stamps redemption code. Without going into too much detail; we've resolved it the same way @hvisser suggests in this PR. Offline support: check ✅. More accurate generation when the device has connectivity: check ✅ (we even persist the offset).

hvisser commented 3 years ago

OK, I'm still confused that you are now bringing up the connectivity issue. As I see it users that are not connected will not see a regression, the app will work as it did before, failing on specific devices.

For users with affected devices that are connected however, this is an improvement. I hope you can agree on that. Given the small impact of this change I'm struggling to see why you'd not want to take this improvement?

There seems to be some confusion around the word "caching". In reality there's no caching of server time, it's more like holding the difference between server time and the local clock in memory only for lifetime of the app process or when it is updated again (which is I believe every time the app is brought to the foreground). I'd consider it a cache if the value would be persisted with the app config for example, this is not the case.

You are right that system broadcasts are not picked up when the app process is not alive, but in that case there would be no need to do anything at all because there wouldn't be any offset held anywhere since it is not persisted. ACTION_TIME_CHANGED could possibly be used for this. The question is if this is needed though. Worse case scenario that I can think of that the offset is totally wrong if the user or the system adjusts the clock, but "swiping away" the app and opening again would always fix this. In reality a user would probably open the app just before entering a venue ("refreshing" the offset) and it would be killed fairly soon again once the app is in the background again so I'd expect this risk to be low.

Well, in some way. But setting the time manually is very visible to the users. The (cached) server time is invisible, and in any case not something we would want to make visible.

That's why I opted to keep the warning message for now. It currently states that the app "might not work correctly" and to set the time to automatic, but on affected devices the time is set to automatic and even if you'd set it to manual a user would be unable to set it at the resolution of seconds precision, aside from the fact that any manual adjustment would need to have the correct time (which the app is not telling you either).

So re:

... You're still able to make it go away by waiting for a whole minute, then setting the time, right? (and fixing potential red screens in the process)

I don't think that cuts it because I can never set the time at such precision that the app requires currently.

Adjusting for clock skew automatically in this situation is very reasonable imho, in fact this is what most, if not all TOTP authentication apps do as well (as @jacobras also mentions), it just is an extra mitigation for a problem that can happen and which helps the user and in turn prevents awkward moments at venues when the code scans red.

So the way I see it, doing something here is better than doing nothing, and the impact of this change seems low to me.

In the end I'm not here to convince you, my goal is to improve the app for everyone and, owning a device that is affected, myself in the process 😄

VWS is encouraging / asking and engaging the community to contribute, so here I am :) I was also hoping for an reply earlier on my filed issues, but I can understand that this is a hectic project and priorities might lie elsewhere, which is in fact also why I decided to dive in and to see if I can help out that way. I'd be happy to invest a couple of hours more if that would result in a solution that is acceptable to the project, as I'm sure others would be too.

confiks commented 3 years ago

Thanks for thinking along, in any case. I was initially confused about your goal here, and thought you actually intended to cache the skew / server time. Introducing caching will definitely create issues, as outlined above. Without it, the fix you're proposing is much less aggressive.

Can at least the same calculations as in hasDeviation be used in getAdjustedClock? They are necessary to cancel out any manual time setting done by the user while the app is active. Then we can test and consider this a bit more, also with regard to how this will affect behavior around the clock skew message.

confiks commented 3 years ago

I don't think that cuts it because I can never set the time at such precision that the app requires currently.

I really don't understand that, and it would help me if you can clear this up. If you attempt to set the time to exactly the same minute (ignoring any seconds), then if your attempts are equally spaced over time, and you restart the app, I would expect that at least half of the time the clock deviation message would disappear (and thus no scanning issues occur), because you're within 30 seconds of the server time (which is the threshold of the message). Regardless of the phone's behavior of which amount of seconds to choose, and regardless of server latency.

I certainly hope that the phone sets the amount of seconds to zero when setting the time manually, but there probably are differences. The phones I'm able to test with do that. So setting the time on the minute mark works to set it manually. You'd need to know the exact time in the first place though.

hvisser commented 3 years ago

Can at least the same calculations as in hasDeviation be used in getAdjustedClock? They are necessary to cancel out any manual time setting done by the user while the app is active.

I've updated that just now so that both methods use the same offset + added a test for the adjusted clock method.

I really don't understand that, and it would help me if you can clear this up.

OK I think you're right that you can fix it manually, so in that sense my wording was too strong. The device does reset the seconds when setting the time, but as you point out already doing this by trial and error is not very practical, especially without a "known good" time source. I wouldn't consider it a viable workaround for the average user of the app at least, which is what I should have said instead of "setting the time manually doesn't really cut it".

jacobras commented 3 years ago

My mother is also experiencing this issue with her Nokia phone. [..]

Did a little test yesterday with some friends. One had a Samsung, couldn't tell the exact type but one of the more budget phones. QR code failed continuously. Time was set to automatic but according to https://time.is it deviated four minutes! Reboot didn't fix it but manually setting the time did.

hvisser commented 3 years ago

Any updates on this?

Now that invalid DCC QR codes are being blocked in the latest release, can we fix false negatives as much as possible too? As @jacobras also implies this issue might affect more devices than just Nokia or Fairphones.

confiks commented 3 years ago

It's a ticket in the next sprint. We'll keep you updated.

BartNijland91 commented 2 years ago

Code has been merged in our private repo and will be included in the 2.6.0 release