Open daimpi opened 4 years ago
Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)
Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)
No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.
No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.
It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)
Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)
Good news! I checked this twice and also uploaded the hourly packages. The number of users seems to be correct:
sum of hourly packages: 3+2+5+8+7+4+3+5 = 37 users
daily package:
37 user(s) found.
They submitted these numbers of keys:
4 user(s): 1 Diagnosis Key(s)
3 user(s): 4 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 6 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
25 user(s): 13 Diagnosis Key(s)
80 keys not parsed (16 without padding).
37 / 4*1, 3*4, 1*5, 1*6, 1*7, 1*8, 1*9, 25*13
hourly package 6:
Length: 390 keys
Padding Multiplier detected: 10
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13
hourly package 7:
Length: 260 keys
Padding Multiplier detected: 10
2 user(s) found.
They submitted these numbers of keys:
2 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
2 / 2*13
hourly package 9:
Length: 220 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 4 Diagnosis Key(s)
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
5 / 1*1, 1*4, 3*13
hourly package 11:
Length: 200 keys
Padding Multiplier detected: 5
8 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
2 user(s): 1 Diagnosis Key(s)
1 user(s): 3 Diagnosis Key(s)
2 user(s): 4 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
30 keys not parsed (6 without padding).
8 / 2*1, 1*3, 2*4, 1*8, 1*13 (1 old Android app(s))
hourly package 14:
Length: 345 keys
Padding Multiplier detected: 5
7 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
4 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
7 / 1*1, 1*7, 1*9, 4*13
hourly package 15:
Length: 195 keys
Padding Multiplier detected: 5
4 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
2 user(s): 13 Diagnosis Key(s)
20 keys not parsed (4 without padding).
4 / 1*1, 1*8, 2*13
hourly package 16:
Length: 195 keys
Padding Multiplier detected: 5
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13
hourly package 19:
Length: 155 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
1 user(s): 1 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
25 keys not parsed (5 without padding).
5 / 1*1, 1*5, 1*7, 1*13 (1 old Android app(s))
I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online https://github.com/corona-warn-app/cwa-server/issues/640
one user 13 keys (1.7-19.6), 1 user 12 keys (1.7-19.6; has no key for 24.6), and 1 user 6 keys (1.7-26.6)
or 4 Users if no hole is allowed: one user 13 keys (1.7-19.6), 1 user 7 keys (1.7- 25.6), 1 user 6 keys (1.7-26.6), and 1 user 5 keys (23.6-19.6)
I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online corona-warn-app/cwa-server#640
This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.
It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)
I think you are right, but: how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.
how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.
You are absolutely right. My claim was just based on the inspection of the hourly packages. I don't see any way to improve the estimated numbers for yesterday. Hopefully, we do not see these multiplier changes too frequently.
I assume the package at 19: has only 3 users. Old android should no longer be possible after pushing server version 1.0.9 online corona-warn-app/cwa-server#640
This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.
for package 11 i get 7 users with hole and 8 users without hole.
I am not sure if this is the correct place here, but you may have seen the Spiegel interview with Mr. Spahn (here (paywall). He says:
SPIEGEL: Wie viele Infektionen wurden inzwischen in der App eingetragen? Spahn: Wir gehen von rund 300 Infektionen aus, die bislang per App gemeldet wurden. Das ist die Zahl der Verschlüsselungs-Codes, die von der Hotline ausgegeben wurden, um andere zu warnen. Mehr wissen wir aus Datenschutzgründen nicht.
Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?
my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded
Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?
@kai-truempler: Thanks for sharing this. I totally agree and I would rather expect people not to call the hotline in case of a positive test (stigma, time, effort, etc.).
my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded
@janpf: This might be an issue, however, I want to point out that every day there's a significant number of keys which get not parsed (vide infra).
Thus, I would expect that the estimates by diagnosis-keys
from @mh- are rather conservative (which I personally prefer). I may add a chart with these unparsed key numbers. At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).
2020-06-23.dat:89 keys not parsed (8 without padding).
2020-06-24.dat:30 keys not parsed (3 without padding).
2020-06-25.dat:50 keys not parsed (5 without padding).
2020-06-26.dat:150 keys not parsed (15 without padding).
2020-06-27.dat:250 keys not parsed (25 without padding).
2020-06-28.dat:40 keys not parsed (4 without padding).
2020-06-29.dat:100 keys not parsed (10 without padding).
2020-06-30.dat:160 keys not parsed (16 without padding).
2020-07-01.dat:290 keys not parsed (29 without padding).
2020-07-02.dat:80 keys not parsed (16 without padding).
Oh absolutely true, I forgot about those "keys not parsed"
What might be beneficial: on my dashboard I just changed to an hourly analysis, as suggested above by @mh-. This means I check every hourly package and calculate the padding, number of keys, number of users etc. individually and then sum things up.
This way I'm currently at a total of 218 users and thereby off by a factor of 1.37 @kai-truempler ;) And if we now consider "keys not parsed" and Mr. Spahn maybe rounding numbers a bit I think it's very hard to get closer to the real number.
At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).
Absolutely.
With parsing all keys you can get a minimum number of infected persons.
Theoretically each single key could belong to one person (the maximum).
If i count the minimum users that submit keys i get round about 250. (23.6. - 02.07.) So there may be 250 users => 300=round(250;-2)
And I'm back down to 188 as the parser just got updated: https://github.com/mh-/diagnosis-keys/commit/104388c7785ef4870e04e34e4290422b756e1ead
Ok, maybe I could change the strategy, now that "old Android apps" cannot submit Diagnosis Keys anymore. For this, it would be nice to understand what information you need from the parsing.
For example, just counting the number of users is very simple now, it would just require counting all keys with TRL 6, because every user will submit exactly one key with that TRL. (And of course divide by the padding multiplier.)
The harder part is to count the number of keys per each user, something that I wanted to do in order to find out if keys can be linked together (violating the "non-linkability-across-multiple-day" promise).
So what exactly do you want from the parser?
Great idea counting the "6"s! Gonna change to that later for the overall user count and most likely going to keep your "counting script" as is for the "number of keys published per user".
Update: did change it and now we're back up to ~200. So still pretty far from the announced 300, but since there are only ~200 "6"s in the database this should be pretty reliable.
I added the option -n
/ --new-android-apps-only
to the parser script. If you use this, this should decrease the number of unparsed keys.
However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.
However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.
Just looking at the example you provided there, you can still at least provide the minimum user count. You can still have the case that it is in fact more users transmitting only random unconnected days, but if you have too many „1“s or „6“s than one user can have, it’s still at least two users. You could collect the minimum user count per risk level (minimum number for the 1s, 6s etc) and then take the max() of them to come to the absolute minimum users generating these keys.
Yes, in the example with the 14 keys, there must have been between 2 and 14 users. This is a wide range, though.
Ok, sure, but in almost all cases it will be the minimum number or very close to it. Which is good enough for the kind of analytics most are looking for.
Note: If you download the hour/day package you will notice, that they will change their content. I seams keys with date<14 days will be deleted. Also the keys are moved to other days. 60 Key of 24.06 are now moved to 23.06. Also 44 keys are deleted for 23.06. For 23.06 hour files have changed from 08, 13, 17 => 10, 15, 18. So to get the right keys you have to use the files downloaded one the day they have been published.
I have made an excel tab and did an manual examination of the keys. I have taken into account, that a device could be switched off for 1 or more day. Nearly every key-chain could be assigned. Only the 23.06- 8:00 keys are not so clear. 01.07. 17:00 is the only one, that contains a chain with no "6". I think, the 6 was not submitted/deleted since it was to old (17.06). After all i get 219(minimum) users, that submit keys. The maximum should be 241.
https://github.com/Tho-Mat/corona-stuff/blob/master/%C3%BCberblick.xlsx
Note: If you download the hour/day package you will notice, that they will change their content. So to get the right keys you have to use the files downloaded one the day they have been published.
Are there any information on why they would do this?
After all i get 219(minimum) users, that submit keys.
Just by counting "6"s I get 231 with the "new" packages for 23./24. and 226 with the old ones. And this method is still more a lowerbound, since it misses some, as you correctly pointed out:
01.07. 17:00 is the only one, that contains a chain with no "6".
Update: I noticed you're doing a "per-key"-padding analysis, while I'm on a "per-package"-basis. That explains the differences. 👍
Are there any information on why they would do this?
I think they will reduce traffic, since it makes no sense to check keys, that are older than 14 day.
Note: If you download the hour/day package you will notice, that they will change their content. I seams keys with date<14 days will be deleted. Also the keys are moved to other days.
@Tho-Mat: Thanks for your comment. At first, I was already a little bit confused last night, because the old hourly packages were changed. My wrong assumption was that the clean-up of the keys older than 14 days is based on a package level and not on the individual key level.
Just as an update to my previous comment, from Phoenix:
Lothar Wieler: "...[rund] 500 Teletans sind ausgegeben worden."
That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.
That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.
Fortunately, the RKI is publishing these numbers on a weekly basis. Thus, I have added another diagram for the published teleTANs last night. However, it is a single PDF which gets overwritten every week.
Looking at the number of issued teleTANs:
In one week (06/07-13/07) 125 teleTANs have been issued. At the same time, parse_keys.py
counted 102 unique users based on the hourly package data which results in a ratio (users counted vs issued teleTANs) of about 82%. I'm very interested where the larger errors comes from (estimated users vs people getting a teleTAN but not sharing their keys). Furthermore, these statistics somehow tell us that the intended way of sharing your keys based on a lab test combined with a QR code is at the moment insignificant.
I think this issue can be closed, now that padding multiplier is set to one on the server. @micb25 do you agree?
The plots in "Verteilung Transmission Risk Level (TRL) in Diagnoseschlüsseln" currently use the number of keys transmitted including the padded fake keys afaiu. As long as the padding factor stays the same this shouldn't be a problem. But this factor will change from tomorrow on (the plan is to bring it down to 1 eventually). The changes in the padding multiplier will cause some distortion in those graphs as new data will receive less weight.
My suggestion would be to use the data which has been corrected for this multiplier like in the "Geteilte Diagnoseschlüssel von positiv getesteten Personen" section. @mh- has introduced an automatic detection for the multiplier used in the data set in his parsing tool: https://github.com/corona-warn-app/cwa-server/issues/620#issuecomment-652511087