Closed garethbowen closed 6 months ago
We can retrieve all of this information except "Last synced time" from the telemetry data gathered from users' devices. To build this kind of bird's eye view of a deployment's devices, we need to get each device's last telemetry entry and compile that data to be consumed by the frontend. It's fair to assume that we will work with thousands of telemetry entries so depending on how that data will be consumed, we are probably going to need to process the data in advance so that queries stay fast.
With this in mind, I see 3 potential solutions right off the bat:
I briefly experimented with the first two solutions in this branch.
With a dataset of 5200 telemetry entries, the telemetry entries query would take anywhere between 1 and 2 seconds despite trying out a few couch optimizations like indexing on the type
property and making a view.
I didn't have the capacity to quickly throw something together for the third solution, I figured we should discuss where we want to go next before investing too much time in any solution.
No matter which solution we go for, I believe we should have some form of data preprocessing to keep the "give me the data" query fast and preserve couch resources for more time-sensitive stuff like syncing.
That's all I can think of right now but we will have a clearer idea of the implementation as we dive into it and get more specific with what/how we want to display the data
@m5r Great writeup, thanks!
I definitely prefer the first solution if we can pull it off. Essentially this offloads the complexity of having a schedule to CouchDB to manage.
I had a quick look at your experimental branch and I think with some more advanced map/reduce features it can be improved and made efficient.
Try this out...
We need to emit the username and the date so we can group by the username and reduce the results to the latest date:
function(doc) {
if (doc.type === 'telemetry') {
emit(doc.metadata.user, {
date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
id: doc._id
});
}
}
Now that we have that we need to define a reduce function which can go over all emissions for a user and find the latest...
function (keys, values) {
var latest = { date: '0000-00-00'};
values.forEach(function(value) {
if (value.date > latest.date) {
latest = value;
}
});
return latest;
}
I tested these queries using the browser URL but you can do the same with PouchDB APIs for the actual code.
We can query using the group
param to get the list of the latest docs for each user:
http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true
Or we can query for the latest doc for one or more users:
http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&key="gareth"
http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&keys=["gareth","m5r"]
This is all untested so you'll need to make changes to get it to work the way you want. Let me know if I can help explain how I got to this conclusion and/or why I think it's the best solution.
That's fantastic @garethbowen, I tried to articulate this into code that satisfies couchdb views' constraints but I struggled to come up with a reduce function that actually worked. It's super neat to get any user's last device infos and it's pretty fast for this scenario, the query takes ~60ms when querying a single user's data
It's still kind of slow when we query the whole database, it takes ~4.1s and that's without including the actual documents. Simply adding ?include_docs=true
doesn't work and return this error {"error":"query_parse_error","reason":"
include_docsis invalid for reduce"}
. I'm still learning how to work with couch views but my understanding is that we either need to query the docs separately or add the document to the indexed view like so:
emit(doc.metadata.user, {
date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
id: doc._id,
doc: doc, // <==
});
This is not ideal because it makes the index much bigger than it should be and we might need to tweak couch config to make it work.
I love how easy it was to tweak the map function to make it include all devices used by a user:
function(doc) {
if (doc.type === 'telemetry') {
emit([doc.metadata.user, doc.metadata.deviceId], {
date: `${doc.metadata.year}-${doc.metadata.month}-${doc.metadata.day}`,
id: doc._id,
});
}
}
NB: my dataset is roughly 5200 telemetry entries from 1100 users and 1700 devices.
It's still kind of slow when we query the whole database, it takes ~4.1s
That's a little surprising to me. Is that every time or just the first time? CouchDB caches results so the second time should be very quick.
Also, how big is the response body?
I'm not sure how you're planning to show this to the admins. If we're putting it in the admin app > users page then we should think about paginating it (showing 100k users on one page won't work!). So possibly you'd only request one page of telemetry docs using keys=[ ... ]
at a time would be fine.
This is not ideal because it makes the index much bigger
Instead of including the entire doc which will destroy the index, it would be kinder to CouchDB to just include the few fields we're interested in. This might be a good compromise between the two.
Is that every time or just the first time? CouchDB caches results so the second time should be very quick.
There is a slight improvement on subsequent requests but it stays above the 4 seconds mark for a response body of 148KB. Here are the timings of 3 requests right after re-creating the view:
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
time_namelookup: 0.000024s
time_connect: 0.000133s
time_appconnect: 0.000000s
time_pretransfer: 0.000187s
time_redirect: 0.000000s
time_starttransfer: 0.089156s
----------
time_total: 4.493183s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
time_namelookup: 0.000024s
time_connect: 0.000107s
time_appconnect: 0.000000s
time_pretransfer: 0.000135s
time_redirect: 0.000000s
time_starttransfer: 0.083579s
----------
time_total: 4.086619s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
time_namelookup: 0.000023s
time_connect: 0.000121s
time_appconnect: 0.000000s
time_pretransfer: 0.000157s
time_redirect: 0.000000s
time_starttransfer: 0.087817s
----------
time_total: 4.096396s
I'm not sure how you're planning to show this to the admins.
This is one of the first things allies will be discussing once teammates come back from their time off
EDIT: according to this couchdb 3.0 post, we shouldn't see much difference in response time between the first and subsequent requests
@jkuester @m5r - thanks for a good to chat on which option to pursue earlier today! In the end, we decided that while #1 may be a slow call to respond to, we wanted to deliver an MVP of this in the form of a public API endpoint which requires authentication. It will return CSV. The CSV format is more readily usable by admins without having to parse/coerce JSON into something more readable - google docs/excel will pop it right open.
This has some advantages:
cc @garethbowen
Here is a sample of the CSV output for users devices details: usersDevices-202401101731.csv
I've got a draft PR open with the changes so far, I've refactored the code to use the data export layer already in place and I'm cleaning it up now
That sample output is great! So other folks can follow along, can you replay back with an updated version of this table describing which each field is? I've intentionally populated them all with questions, you can respond with them as answers ;)
Column | Explanation | Sample |
---|---|---|
user | The offline user's username in the CHT? Do we capture online users too? | asc-test-1 |
deviceId | UUID of...the unique device the offline user is using? | d5b945ab-011c-4a3f-be01-550fa3eee147 |
date | Date telemetry was taken in YEAR-MONTH-DAY? Will there be one per day multiple per day? | 2023-12-26 |
webview | Version of WebView that CHT Android is using? | 107 |
apk | CHT Android version on the device? What about if they're using a desktop browser? | v1.0.4-4 |
android | Android version the mobile handset is running? | 8.1.0 |
cht | Version of CHT the user was on at time of telemetry? How could these be different for 2 users on same instance? | 4.2.2 |
settings | ID of...something? Revision of App Settings JSON maybe? | 15-134ed3054068e36bbcf2ab440a994817 |
You got them all right :smile:
Column | Explanation | Sample |
---|---|---|
user | The user's username in the CHT. No distinction between roles/online/offline | asc-test-1 |
deviceId | UUID of the user's device | d5b945ab-011c-4a3f-be01-550fa3eee147 |
date | Date telemetry was taken in YYYY-MM-DD - at most one entry per day (relevant docs) |
2023-12-26 |
webview | Version of WebView that CHT Android is using OR version of Chrome | 107 |
apk | Version of CHT Android. undefined if not using CHT Android |
v1.0.4-4 |
android | Version of the Android OS the mobile handset is running | 8.1.0 |
cht | Version of CHT the user was on at time of telemetry. These could be different for 2 users on same instance if the telemetry entries are from a different time, or if the user stayed offline for a while before syncing and the instance had been upgraded in the meantime | 4.2.2 |
settings | Revision of the App Settings document | 15-134ed3054068e36bbcf2ab440a994817 |
Implementation details question, should this be gated behind a new permission or can we piggy back off an already existing one?
For now it's behind the can_export_feedback
permission that's also used for the export of user feedback (= detected errors and user feedback submitted via the “Report bug”)
Awesome - thanks for the updated table! That makes a lot of sense.
I think it should be gated behind a new permission - thanks for asking! As it may have a different set up PII (but not likely PHI), it'd be good to allow folks to assign to which ever role the see fit, but still have the ability to split them.
In the original request, there's not an explicit requirement of something like, "in the CHT Core admin app, show a dashboard that is updated every hour and loads in under 100ms". To that end, we're hoping that a well documented API that returns CSV is an acceptable MVP for this ticket.
This is meant to deliver on all the requested items, while intentionally not solving the issue that the API will likely be very slow to respond with the CSV, thus not suitable for use directly within the CHT as a caching mechanism would be required.
@garethbowen - Please let us know if a slow to respond API which returns CSV is acceptable here!
I think slow to respond is ok, depending on our definition of slow. I know it's been tested at 5 seconds on 5k users, but what happens with 50k users or 1m telemetry docs? As long as it still completes before timing out it's probably ok for now.
CSV is not ideal because it's leaving most of the work up to the user. In particular the issue description talks about typical questions admins might ask, eg: "which users are on an unsupported version of the browser". The CSV as it is right now doesn't actually answer that question - it just lists the current browser version with no indication on support. If the API were JSON it would be easy to provide a dashboard which shows only users with potential issues rather than flat file of all users.
The JSON API is also useful for tooling for uses we haven't thought of, eg: app services building a CLI tool to consume it.
Thanks Gareth!
@m5r - I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format? I defer to you.
For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user. I propose we ship this as is (but with JSON support) and optimize the code later if it doesn't scale sufficiently.
For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user.
Fair point. Looking at the monitoring api for a large recent instance (which shall go unnamed to protect the innocent) I can see the medic-users-meta db has ~400k docs from ~8k users. It looks like 225k of these are feedback docs, which leaves just under 200k telemetry docs. This will increase by up to 8k docs per day (one per active user).
How confident are we that the proposed solution will scale to 200k or up to approx 2m docs we expect by end of year?
Thanks for providing some real world numbers that we'll face now (or soon-ish). I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.
Regardless, I think that since we're doing a flavor of # 1 with our latest code, it's worth shipping in CHT 4.6 and the refining if we find it needs improvement. # 2 or # 3 (or other performance improvements) will very likely leverage the exact work we're doing now.
To be clear: I'm pushing to ship this as is (well +JSON). I think it's worth it! We can iterate! However, I'd very much like to hear if you disagree :ear:
I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.
I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.
Awesome! I'll work to have some tests done before we release it and report back.
I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format?
The data exporter service is architectured in a way that doesn't give the requester any choice in the format returned, it's either CSV or JSON and it's hardcoded for each exporter. Currently the only exporter that returns JSON is the DHIS exporter and it's structured a bit differently than the other exporters that return CSV.
We can make it work by allowing users to pass a format parameter in the url like ?format=json
and restructure bits of the exporters to allow for different formats within the same exporter. If no format is specified, the exporter would default to its current default format (JSON for DHIS, CSV for theothers).
@m5r - thanks for the update on exporters. It is reasonable to ship only JSON if that ends up being a net less amount of work (sounds like it might). We can then provide easy to follow docs to leverage jq
and friends to flatten the results to CSV as needed.
Please decide which path you want to pursue (just JSON or both JSON and CSV) and let us know - thanks!
(and sorry if we end up throwing away CSV efforts)
Quick update, I ended up ditching the CSV format to keep only JSON. It's a small effort to implement one or the other and like you mentioned there are plenty of tools available to easily convert JSON to CSV.
The opened PR is still using the Couch view to query and process telemetry entries but I pushed 2 WIP-ish branches to compare performances between the implementations discussed in Slack last week.
In branch 8462-raw-query
, the API processes all telemetry entries with a single couch _allDocs
query. It's fast but not memory efficient.
In branch 8462-smart-query
, the API progressively processes telemetry entries in chunks of 1000. It should be much faster than the Couch view, a bit slower than the other implementation but way more efficient with memory because it never keeps more than n + 1000
(where n = number of users) docs in memory.
With all this, we should be able to make a decision on which solution we'll go for this feature to properly support large scale deployments.
Thanks for the update @m5r ! I suspect we may be forced to go the smart-query
route because if some one has 10k users we may exceed available memory yes? I defer to you.
As well: @jkuester - On your test instance in EKS, I think you should be able to easily switch between @m5r's ECR images from the branch builds as they're close enough that CHT won't care that you switched versions without re-loading all the data in, warming views etc.
Adding this to 4.6 milestone in case it's ready really quickly - but @jkuester feel free to move it to 4.7 if it's not ready in time - no biggie!
I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.
I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.
Hey all! Noting that thanks to @jkuester and the TDG, we got some good data (see below). @garethbowen commented on slack that this data validates this ticket as is is good to go in 4.6 or 4.7 when ready \o/
Users | Seconds | Minutes |
---|---|---|
10 | 1 | 0.02 |
100 | 5 | 0.08 |
1,000 | 44 | 0.73 |
5,000 | 213 | 3.55 |
10,000 | 511 | 8.52 |
20,000 | 1300* | 21.67* |
* 20k user data extrapolated.
For testing, each user got 200 telemetry docs, a 1:200 ratio for 2 million docs total. This ratio was derived from analyzing production CHT instances using Watchdog and noting the highest ration seen in the wild was 1:191.
Here's the ratio table for reference, with CHT URls removed:
Version | Feedback Docs | _users Docs | medic-users-meta Docs | Meta to User Ratio | Telemetry entries | Telemetry to User ratio |
---|---|---|---|---|---|---|
4.5.0.6922454971 | 1464 | 108 | 2877 | 26.6 | 1413 | 13.1 |
4.2.2.5489334896 | 834 | 194 | 2793 | 14.4 | 1959 | 10.1 |
4.2.2.5489334896 | 37921 | 11365 | 64370 | 5.66 | 26449 | 2.33 |
4.2.2.5489334896 | 25284 | 12584 | 47733 | 3.79 | 22449 | 1.78 |
4.4.0.6255863600 | 82917 | 28750 | 162834 | 5.66 | 79917 | 2.78 |
3.14.2 | 26467 | 872 | 33525 | 38.4 | 7058 | 8.09 |
3.17.1 | 54 | 101 | 576 | 5.7 | 522 | 5.17 |
3.17.1 | 241 | 108 | 1268 | 11.7 | 1027 | 9.51 |
4.2.0.5075680573 | 383 | 1569 | 9216 | 5.87 | 8833 | 5.63 |
4.2.2.5489334896 | 13652 | 4161 | 72211 | 17.4 | 58559 | 14.1 |
4.5.0.6922454971 | 4341 | 612 | 8285 | 13.5 | 3944 | 6.44 |
4.5.0.6922454971 | 879 | 117 | 1809 | 15.5 | 930 | 7.95 |
4.2.2.5489334896 | 120 | 2887 | 5528 | 1.91 | 5408 | 1.87 |
4.2.0.5075680573 | 205 | 579 | 1248 | 2.16 | 1043 | 1.8 |
3.15.0-FR-bulk-user-upload | 273 | 331 | 1320 | 3.99 | 1047 | 3.16 |
4.2.4.6465855007 | 1159 | 3232 | 9703 | 3 | 8544 | 2.64 |
3.15.0 | 120843 | 2813 | 223868 | 79.6 | 103025 | 36.6 |
3.15.0 | 34175 | 2975 | 53244 | 17.9 | 19069 | 6.41 |
3.17.1 | 410 | 135 | 1970 | 14.6 | 1560 | 11.6 |
3.15.0 | 37 | 161 | 4947 | 30.7 | 4910 | 30.5 |
3.15.0 | 10716538 | 574 | 10828326 | 18865 | 111788 | 195 |
3.14.2 | 35383 | 828 | 135304 | 163 | 99921 | 121 |
3.15.0 | 286 | 167 | 12596 | 75.4 | 12310 | 73.7 |
3.15.0 | 10715318 | 583 | 10774344 | 18481 | 59026 | 101 |
@garethbowen it struck me today that currently our map-reduce logic is resolving the latest telemetry document for a user. However, if a user is regularly switching between multiple devices (e.g. a phone and the browser on a computer) then the data returned could be incomplete (you would only see the laptop or the phone, not both). Should we actually be reducing down to a key of ${user}-${deviceId}
? The downside of doing that is you would get historical results for every device that a user ever had (unless we added some date filtering...(.
Including multiple devices would be ideal, but multiple device users aren't common enough that we need to hold up this workflow if it's not an easy improvement. Historical data might actually be useful here, for example, for confirming when a device was replaced, and that the old one was "decommissioned" (is no longer connecting).
I think it's worth a try but don't spend too long on it.
Also consider using an array rather than concatenating the key (ie: [user, deviceId]
rather than ${user}-${deviceId}
). Then you can use group_level
to choose whether you want to get the latest for the user or the user+device (docs).
:+1: Good call on the array! I have updated the map/reduce to do this. I guess for now I will leave the user-devices
endpoint to not set anything for the group_level
and instead return records for each user,deviceId pair. But, in the future we could come back and support parameters here for that API to allow choosing the group_level....
Is your feature request related to a problem? Please describe. Technical administrators should be able to find devices which are unsupported or likely to be slow so that they can proactively replace or upgrade them and keep users engaged and productive.
Describe the solution you'd like Provide a feature to show administrators key information about their users and their devices. This could be done in the admin app, or a dashboard, a Mobile Device Management service, or through Watchdog (or some combination of the above).
Some ideas for information to show are...
The critical thing is for this to be actionable. Some things admins should be working on are,
Remember that users can have multiple devices and we should list all.
Describe alternatives you've considered This could be managed outside of the CHT but we already collect much of this information, so it makes sense to expose it somewhere.