medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
438 stars 209 forks source link

Expose user device details to admins #8462

Closed garethbowen closed 6 months ago

garethbowen commented 1 year ago

Is your feature request related to a problem? Please describe. Technical administrators should be able to find devices which are unsupported or likely to be slow so that they can proactively replace or upgrade them and keep users engaged and productive.

Describe the solution you'd like Provide a feature to show administrators key information about their users and their devices. This could be done in the admin app, or a dashboard, a Mobile Device Management service, or through Watchdog (or some combination of the above).

Some ideas for information to show are...

The critical thing is for this to be actionable. Some things admins should be working on are,

Remember that users can have multiple devices and we should list all.

Describe alternatives you've considered This could be managed outside of the CHT but we already collect much of this information, so it makes sense to expose it somewhere.

m5r commented 8 months ago

We can retrieve all of this information except "Last synced time" from the telemetry data gathered from users' devices. To build this kind of bird's eye view of a deployment's devices, we need to get each device's last telemetry entry and compile that data to be consumed by the frontend. It's fair to assume that we will work with thousands of telemetry entries so depending on how that data will be consumed, we are probably going to need to process the data in advance so that queries stay fast.

With this in mind, I see 3 potential solutions right off the bat:

I briefly experimented with the first two solutions in this branch. With a dataset of 5200 telemetry entries, the telemetry entries query would take anywhere between 1 and 2 seconds despite trying out a few couch optimizations like indexing on the type property and making a view. I didn't have the capacity to quickly throw something together for the third solution, I figured we should discuss where we want to go next before investing too much time in any solution. No matter which solution we go for, I believe we should have some form of data preprocessing to keep the "give me the data" query fast and preserve couch resources for more time-sensitive stuff like syncing.

That's all I can think of right now but we will have a clearer idea of the implementation as we dive into it and get more specific with what/how we want to display the data

garethbowen commented 8 months ago

@m5r Great writeup, thanks!

I definitely prefer the first solution if we can pull it off. Essentially this offloads the complexity of having a schedule to CouchDB to manage.

I had a quick look at your experimental branch and I think with some more advanced map/reduce features it can be improved and made efficient.

Try this out...

map function:

We need to emit the username and the date so we can group by the username and reduce the results to the latest date:

function(doc) {
 if (doc.type === 'telemetry') {
   emit(doc.metadata.user, {
    date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
    id: doc._id
  });
 } 
}

reduce function:

Now that we have that we need to define a reduce function which can go over all emissions for a user and find the latest...

function (keys, values) {
  var latest = { date: '0000-00-00'};
  values.forEach(function(value) {
    if (value.date > latest.date) {
      latest = value;
    }
  });
  return latest;
}

queries

I tested these queries using the browser URL but you can do the same with PouchDB APIs for the actual code.

We can query using the group param to get the list of the latest docs for each user:

http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true

Or we can query for the latest doc for one or more users:

http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&key="gareth"
http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&keys=["gareth","m5r"]

This is all untested so you'll need to make changes to get it to work the way you want. Let me know if I can help explain how I got to this conclusion and/or why I think it's the best solution.

m5r commented 8 months ago

That's fantastic @garethbowen, I tried to articulate this into code that satisfies couchdb views' constraints but I struggled to come up with a reduce function that actually worked. It's super neat to get any user's last device infos and it's pretty fast for this scenario, the query takes ~60ms when querying a single user's data

It's still kind of slow when we query the whole database, it takes ~4.1s and that's without including the actual documents. Simply adding ?include_docs=true doesn't work and return this error {"error":"query_parse_error","reason":"include_docsis invalid for reduce"}. I'm still learning how to work with couch views but my understanding is that we either need to query the docs separately or add the document to the indexed view like so:

 emit(doc.metadata.user, {
  date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
  id: doc._id,
  doc: doc, // <==
});

This is not ideal because it makes the index much bigger than it should be and we might need to tweak couch config to make it work.

I love how easy it was to tweak the map function to make it include all devices used by a user:

function(doc) {
  if (doc.type === 'telemetry') {
    emit([doc.metadata.user, doc.metadata.deviceId], {
      date: `${doc.metadata.year}-${doc.metadata.month}-${doc.metadata.day}`,
      id: doc._id,
    });
  }
}

NB: my dataset is roughly 5200 telemetry entries from 1100 users and 1700 devices.

garethbowen commented 8 months ago

It's still kind of slow when we query the whole database, it takes ~4.1s

That's a little surprising to me. Is that every time or just the first time? CouchDB caches results so the second time should be very quick.

Also, how big is the response body?

I'm not sure how you're planning to show this to the admins. If we're putting it in the admin app > users page then we should think about paginating it (showing 100k users on one page won't work!). So possibly you'd only request one page of telemetry docs using keys=[ ... ] at a time would be fine.

This is not ideal because it makes the index much bigger

Instead of including the entire doc which will destroy the index, it would be kinder to CouchDB to just include the few fields we're interested in. This might be a good compromise between the two.

m5r commented 8 months ago

Is that every time or just the first time? CouchDB caches results so the second time should be very quick.

There is a slight improvement on subsequent requests but it stays above the 4 seconds mark for a response body of 148KB. Here are the timings of 3 requests right after re-creating the view:

~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000024s
        time_connect:  0.000133s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000187s
       time_redirect:  0.000000s
  time_starttransfer:  0.089156s
                     ----------
          time_total:  4.493183s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000024s
        time_connect:  0.000107s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000135s
       time_redirect:  0.000000s
  time_starttransfer:  0.083579s
                     ----------
          time_total:  4.086619s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000023s
        time_connect:  0.000121s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000157s
       time_redirect:  0.000000s
  time_starttransfer:  0.087817s
                     ----------
          time_total:  4.096396s

I'm not sure how you're planning to show this to the admins.

This is one of the first things allies will be discussing once teammates come back from their time off

EDIT: according to this couchdb 3.0 post, we shouldn't see much difference in response time between the first and subsequent requests

mrjones-plip commented 8 months ago

@jkuester @m5r - thanks for a good to chat on which option to pursue earlier today! In the end, we decided that while #1 may be a slow call to respond to, we wanted to deliver an MVP of this in the form of a public API endpoint which requires authentication. It will return CSV. The CSV format is more readily usable by admins without having to parse/coerce JSON into something more readable - google docs/excel will pop it right open.

This has some advantages:

cc @garethbowen

m5r commented 8 months ago

Here is a sample of the CSV output for users devices details: usersDevices-202401101731.csv

I've got a draft PR open with the changes so far, I've refactored the code to use the data export layer already in place and I'm cleaning it up now

mrjones-plip commented 8 months ago

That sample output is great! So other folks can follow along, can you replay back with an updated version of this table describing which each field is? I've intentionally populated them all with questions, you can respond with them as answers ;)

Column Explanation Sample
user The offline user's username in the CHT? Do we capture online users too? asc-test-1
deviceId UUID of...the unique device the offline user is using? d5b945ab-011c-4a3f-be01-550fa3eee147
date Date telemetry was taken in YEAR-MONTH-DAY? Will there be one per day multiple per day? 2023-12-26
webview Version of WebView that CHT Android is using? 107
apk CHT Android version on the device? What about if they're using a desktop browser? v1.0.4-4
android Android version the mobile handset is running? 8.1.0
cht Version of CHT the user was on at time of telemetry? How could these be different for 2 users on same instance? 4.2.2
settings ID of...something? Revision of App Settings JSON maybe? 15-134ed3054068e36bbcf2ab440a994817
m5r commented 8 months ago

You got them all right :smile:

Column Explanation Sample
user The user's username in the CHT. No distinction between roles/online/offline asc-test-1
deviceId UUID of the user's device d5b945ab-011c-4a3f-be01-550fa3eee147
date Date telemetry was taken in YYYY-MM-DD - at most one entry per day (relevant docs) 2023-12-26
webview Version of WebView that CHT Android is using OR version of Chrome 107
apk Version of CHT Android. undefined if not using CHT Android v1.0.4-4
android Version of the Android OS the mobile handset is running 8.1.0
cht Version of CHT the user was on at time of telemetry. These could be different for 2 users on same instance if the telemetry entries are from a different time, or if the user stayed offline for a while before syncing and the instance had been upgraded in the meantime 4.2.2
settings Revision of the App Settings document 15-134ed3054068e36bbcf2ab440a994817
m5r commented 8 months ago

Implementation details question, should this be gated behind a new permission or can we piggy back off an already existing one? For now it's behind the can_export_feedback permission that's also used for the export of user feedback (= detected errors and user feedback submitted via the “Report bug”)

mrjones-plip commented 8 months ago

Awesome - thanks for the updated table! That makes a lot of sense.

I think it should be gated behind a new permission - thanks for asking! As it may have a different set up PII (but not likely PHI), it'd be good to allow folks to assign to which ever role the see fit, but still have the ability to split them.

mrjones-plip commented 8 months ago

In the original request, there's not an explicit requirement of something like, "in the CHT Core admin app, show a dashboard that is updated every hour and loads in under 100ms". To that end, we're hoping that a well documented API that returns CSV is an acceptable MVP for this ticket.

This is meant to deliver on all the requested items, while intentionally not solving the issue that the API will likely be very slow to respond with the CSV, thus not suitable for use directly within the CHT as a caching mechanism would be required.

@garethbowen - Please let us know if a slow to respond API which returns CSV is acceptable here!

garethbowen commented 8 months ago

I think slow to respond is ok, depending on our definition of slow. I know it's been tested at 5 seconds on 5k users, but what happens with 50k users or 1m telemetry docs? As long as it still completes before timing out it's probably ok for now.

CSV is not ideal because it's leaving most of the work up to the user. In particular the issue description talks about typical questions admins might ask, eg: "which users are on an unsupported version of the browser". The CSV as it is right now doesn't actually answer that question - it just lists the current browser version with no indication on support. If the API were JSON it would be easy to provide a dashboard which shows only users with potential issues rather than flat file of all users.

The JSON API is also useful for tooling for uses we haven't thought of, eg: app services building a CLI tool to consume it.

mrjones-plip commented 8 months ago

Thanks Gareth!

@m5r - I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format? I defer to you.

For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user. I propose we ship this as is (but with JSON support) and optimize the code later if it doesn't scale sufficiently.

garethbowen commented 8 months ago

For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user.

Fair point. Looking at the monitoring api for a large recent instance (which shall go unnamed to protect the innocent) I can see the medic-users-meta db has ~400k docs from ~8k users. It looks like 225k of these are feedback docs, which leaves just under 200k telemetry docs. This will increase by up to 8k docs per day (one per active user).

How confident are we that the proposed solution will scale to 200k or up to approx 2m docs we expect by end of year?

mrjones-plip commented 8 months ago

Thanks for providing some real world numbers that we'll face now (or soon-ish). I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

Regardless, I think that since we're doing a flavor of # 1 with our latest code, it's worth shipping in CHT 4.6 and the refining if we find it needs improvement. # 2 or # 3 (or other performance improvements) will very likely leverage the exact work we're doing now.

To be clear: I'm pushing to ship this as is (well +JSON). I think it's worth it! We can iterate! However, I'd very much like to hear if you disagree :ear:

garethbowen commented 8 months ago

I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.

mrjones-plip commented 8 months ago

Awesome! I'll work to have some tests done before we release it and report back.

m5r commented 7 months ago

I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format?

The data exporter service is architectured in a way that doesn't give the requester any choice in the format returned, it's either CSV or JSON and it's hardcoded for each exporter. Currently the only exporter that returns JSON is the DHIS exporter and it's structured a bit differently than the other exporters that return CSV.

We can make it work by allowing users to pass a format parameter in the url like ?format=json and restructure bits of the exporters to allow for different formats within the same exporter. If no format is specified, the exporter would default to its current default format (JSON for DHIS, CSV for theothers).

mrjones-plip commented 7 months ago

@m5r - thanks for the update on exporters. It is reasonable to ship only JSON if that ends up being a net less amount of work (sounds like it might). We can then provide easy to follow docs to leverage jq and friends to flatten the results to CSV as needed.

Please decide which path you want to pursue (just JSON or both JSON and CSV) and let us know - thanks!

(and sorry if we end up throwing away CSV efforts)

m5r commented 7 months ago

Quick update, I ended up ditching the CSV format to keep only JSON. It's a small effort to implement one or the other and like you mentioned there are plenty of tools available to easily convert JSON to CSV.

The opened PR is still using the Couch view to query and process telemetry entries but I pushed 2 WIP-ish branches to compare performances between the implementations discussed in Slack last week.
In branch 8462-raw-query, the API processes all telemetry entries with a single couch _allDocs query. It's fast but not memory efficient. In branch 8462-smart-query, the API progressively processes telemetry entries in chunks of 1000. It should be much faster than the Couch view, a bit slower than the other implementation but way more efficient with memory because it never keeps more than n + 1000 (where n = number of users) docs in memory.

With all this, we should be able to make a decision on which solution we'll go for this feature to properly support large scale deployments.

mrjones-plip commented 7 months ago

Thanks for the update @m5r ! I suspect we may be forced to go the smart-query route because if some one has 10k users we may exceed available memory yes? I defer to you.

As well: @jkuester - On your test instance in EKS, I think you should be able to easily switch between @m5r's ECR images from the branch builds as they're close enough that CHT won't care that you switched versions without re-loading all the data in, warming views etc.

mrjones-plip commented 7 months ago

Adding this to 4.6 milestone in case it's ready really quickly - but @jkuester feel free to move it to 4.7 if it's not ready in time - no biggie!

mrjones-plip commented 7 months ago

I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.

Hey all! Noting that thanks to @jkuester and the TDG, we got some good data (see below). @garethbowen commented on slack that this data validates this ticket as is is good to go in 4.6 or 4.7 when ready \o/


Time to run API against a CHT instance

Users Seconds Minutes
10 1 0.02
100 5 0.08
1,000 44 0.73
5,000 213 3.55
10,000 511 8.52
20,000 1300* 21.67*

* 20k user data extrapolated.

image

For testing, each user got 200 telemetry docs, a 1:200 ratio for 2 million docs total. This ratio was derived from analyzing production CHT instances using Watchdog and noting the highest ration seen in the wild was 1:191.

Here's the ratio table for reference, with CHT URls removed:

Version Feedback Docs _users Docs medic-users-meta Docs Meta to User Ratio Telemetry entries Telemetry to User ratio
4.5.0.6922454971 1464 108 2877 26.6 1413 13.1
4.2.2.5489334896 834 194 2793 14.4 1959 10.1
4.2.2.5489334896 37921 11365 64370 5.66 26449 2.33
4.2.2.5489334896 25284 12584 47733 3.79 22449 1.78
4.4.0.6255863600 82917 28750 162834 5.66 79917 2.78
3.14.2 26467 872 33525 38.4 7058 8.09
3.17.1 54 101 576 5.7 522 5.17
3.17.1 241 108 1268 11.7 1027 9.51
4.2.0.5075680573 383 1569 9216 5.87 8833 5.63
4.2.2.5489334896 13652 4161 72211 17.4 58559 14.1
4.5.0.6922454971 4341 612 8285 13.5 3944 6.44
4.5.0.6922454971 879 117 1809 15.5 930 7.95
4.2.2.5489334896 120 2887 5528 1.91 5408 1.87
4.2.0.5075680573 205 579 1248 2.16 1043 1.8
3.15.0-FR-bulk-user-upload 273 331 1320 3.99 1047 3.16
4.2.4.6465855007 1159 3232 9703 3 8544 2.64
3.15.0 120843 2813 223868 79.6 103025 36.6
3.15.0 34175 2975 53244 17.9 19069 6.41
3.17.1 410 135 1970 14.6 1560 11.6
3.15.0 37 161 4947 30.7 4910 30.5
3.15.0 10716538 574 10828326 18865 111788 195
3.14.2 35383 828 135304 163 99921 121
3.15.0 286 167 12596 75.4 12310 73.7
3.15.0 10715318 583 10774344 18481 59026 101
jkuester commented 7 months ago

@garethbowen it struck me today that currently our map-reduce logic is resolving the latest telemetry document for a user. However, if a user is regularly switching between multiple devices (e.g. a phone and the browser on a computer) then the data returned could be incomplete (you would only see the laptop or the phone, not both). Should we actually be reducing down to a key of ${user}-${deviceId}? The downside of doing that is you would get historical results for every device that a user ever had (unless we added some date filtering...(.

garethbowen commented 6 months ago

Including multiple devices would be ideal, but multiple device users aren't common enough that we need to hold up this workflow if it's not an easy improvement. Historical data might actually be useful here, for example, for confirming when a device was replaced, and that the old one was "decommissioned" (is no longer connecting).

I think it's worth a try but don't spend too long on it.

Also consider using an array rather than concatenating the key (ie: [user, deviceId] rather than ${user}-${deviceId}). Then you can use group_level to choose whether you want to get the latest for the user or the user+device (docs).

jkuester commented 6 months ago

:+1: Good call on the array! I have updated the map/reduce to do this. I guess for now I will leave the user-devices endpoint to not set anything for the group_level and instead return records for each user,deviceId pair. But, in the future we could come back and support parameters here for that API to allow choosing the group_level....