Expose user device details to admins

garethbowen commented 1 year ago

Is your feature request related to a problem? Please describe. Technical administrators should be able to find devices which are unsupported or likely to be slow so that they can proactively replace or upgrade them and keep users engaged and productive.

Describe the solution you'd like Provide a feature to show administrators key information about their users and their devices. This could be done in the admin app, or a dashboard, a Mobile Device Management service, or through Watchdog (or some combination of the above).

Some ideas for information to show are...

Browser version
Android OS version
apk version
Last synced time
Device specs
CHT Core version (has the user synced and reloaded since the update was installed)
Configuration version (has the user synced and reloaded since the config was updated)

The critical thing is for this to be actionable. Some things admins should be working on are,

Which users have unsupported browsers.
Which users have browsers which are not supported by the version of the CHT we want to upgrade to.
Which users have software with known vulnerabilities.
Which users have old hardware which is due to be replaced.
etc

Remember that users can have multiple devices and we should list all.

Describe alternatives you've considered This could be managed outside of the CHT but we already collect much of this information, so it makes sense to expose it somewhere.

m5r commented 8 months ago

We can retrieve all of this information except "Last synced time" from the telemetry data gathered from users' devices. To build this kind of bird's eye view of a deployment's devices, we need to get each device's last telemetry entry and compile that data to be consumed by the frontend. It's fair to assume that we will work with thousands of telemetry entries so depending on how that data will be consumed, we are probably going to need to process the data in advance so that queries stay fast.

With this in mind, I see 3 potential solutions right off the bat:

Solution #1 - naive approach - display data in the admin app
- admin app calls the API
- API queries couch for telemetry entries
- API processes telemetry entries on the fly
Solution #2 - scheduled data processing - display data in the admin app
- sentinel queries couch for telemetry entries and processes them on a schedule, preferably when couch is not in high demand (maybe once a day, at night?)
- admin app calls the API
- API queries couch for the document produced by sentinel
Solution #3 - scheduled data processing - display data in some kind of analytics app
- process telemetry data on a schedule with cht-sync
- show it in analytics app used by deployment (klipfolio or similar) even faster query

I briefly experimented with the first two solutions in this branch. With a dataset of 5200 telemetry entries, the telemetry entries query would take anywhere between 1 and 2 seconds despite trying out a few couch optimizations like indexing on the type property and making a view. I didn't have the capacity to quickly throw something together for the third solution, I figured we should discuss where we want to go next before investing too much time in any solution. No matter which solution we go for, I believe we should have some form of data preprocessing to keep the "give me the data" query fast and preserve couch resources for more time-sensitive stuff like syncing.

That's all I can think of right now but we will have a clearer idea of the implementation as we dive into it and get more specific with what/how we want to display the data

garethbowen commented 8 months ago

@m5r Great writeup, thanks!

I definitely prefer the first solution if we can pull it off. Essentially this offloads the complexity of having a schedule to CouchDB to manage.

I had a quick look at your experimental branch and I think with some more advanced map/reduce features it can be improved and made efficient.

Try this out...

map function:

We need to emit the username and the date so we can group by the username and reduce the results to the latest date:

function(doc) {
 if (doc.type === 'telemetry') {
   emit(doc.metadata.user, {
    date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
    id: doc._id
  });
 } 
}

reduce function:

Now that we have that we need to define a reduce function which can go over all emissions for a user and find the latest...

function (keys, values) {
  var latest = { date: '0000-00-00'};
  values.forEach(function(value) {
    if (value.date > latest.date) {
      latest = value;
    }
  });
  return latest;
}

queries

I tested these queries using the browser URL but you can do the same with PouchDB APIs for the actual code.

We can query using the group param to get the list of the latest docs for each user:

http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true

Or we can query for the latest doc for one or more users:

http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&key="gareth"
http://localhost:5988/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true&keys=["gareth","m5r"]

This is all untested so you'll need to make changes to get it to work the way you want. Let me know if I can help explain how I got to this conclusion and/or why I think it's the best solution.

m5r commented 8 months ago

That's fantastic @garethbowen, I tried to articulate this into code that satisfies couchdb views' constraints but I struggled to come up with a reduce function that actually worked. It's super neat to get any user's last device infos and it's pretty fast for this scenario, the query takes ~60ms when querying a single user's data

It's still kind of slow when we query the whole database, it takes ~4.1s and that's without including the actual documents. Simply adding ?include_docs=true doesn't work and return this error {"error":"query_parse_error","reason":"include_docsis invalid for reduce"}. I'm still learning how to work with couch views but my understanding is that we either need to query the docs separately or add the document to the indexed view like so:

 emit(doc.metadata.user, {
  date: doc.metadata.year + '-' + doc.metadata.month + '-' + doc.metadata.day,
  id: doc._id,
  doc: doc, // <==
});

This is not ideal because it makes the index much bigger than it should be and we might need to tweak couch config to make it work.

I love how easy it was to tweak the map function to make it include all devices used by a user:

function(doc) {
  if (doc.type === 'telemetry') {
    emit([doc.metadata.user, doc.metadata.deviceId], {
      date: `${doc.metadata.year}-${doc.metadata.month}-${doc.metadata.day}`,
      id: doc._id,
    });
  }
}

NB: my dataset is roughly 5200 telemetry entries from 1100 users and 1700 devices.

garethbowen commented 8 months ago

It's still kind of slow when we query the whole database, it takes ~4.1s

That's a little surprising to me. Is that every time or just the first time? CouchDB caches results so the second time should be very quick.

Also, how big is the response body?

I'm not sure how you're planning to show this to the admins. If we're putting it in the admin app > users page then we should think about paginating it (showing 100k users on one page won't work!). So possibly you'd only request one page of telemetry docs using keys=[ ... ] at a time would be fine.

This is not ideal because it makes the index much bigger

Instead of including the entire doc which will destroy the index, it would be kinder to CouchDB to just include the few fields we're interested in. This might be a good compromise between the two.

m5r commented 8 months ago

Is that every time or just the first time? CouchDB caches results so the second time should be very quick.

There is a slight improvement on subsequent requests but it stays above the 4 seconds mark for a response body of 148KB. Here are the timings of 3 requests right after re-creating the view:

~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000024s
        time_connect:  0.000133s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000187s
       time_redirect:  0.000000s
  time_starttransfer:  0.089156s
                     ----------
          time_total:  4.493183s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000024s
        time_connect:  0.000107s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000135s
       time_redirect:  0.000000s
  time_starttransfer:  0.083579s
                     ----------
          time_total:  4.086619s
~ curl -w "@$HOME/curl_format" -o /dev/null -s "http://admin:groot@localhost:5984/medic-users-meta/_design/users-meta/_view/device_telemetry_entries?group=true"
     time_namelookup:  0.000023s
        time_connect:  0.000121s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000157s
       time_redirect:  0.000000s
  time_starttransfer:  0.087817s
                     ----------
          time_total:  4.096396s

I'm not sure how you're planning to show this to the admins.

This is one of the first things allies will be discussing once teammates come back from their time off

EDIT: according to this couchdb 3.0 post, we shouldn't see much difference in response time between the first and subsequent requests

mrjones-plip commented 8 months ago

@jkuester @m5r - thanks for a good to chat on which option to pursue earlier today! In the end, we decided that while #1 may be a slow call to respond to, we wanted to deliver an MVP of this in the form of a public API endpoint which requires authentication. It will return CSV. The CSV format is more readily usable by admins without having to parse/coerce JSON into something more readable - google docs/excel will pop it right open.

This has some advantages:

no UI to develop
no updates to the semi-deprecated Admin section of CHT Core
easily could feed into the #2 option at a later date
a version of above, could be scraped by Prometheus and pulled into Watchdog, at a not so later date!

cc @garethbowen

m5r commented 8 months ago

Here is a sample of the CSV output for users devices details: usersDevices-202401101731.csv

I've got a draft PR open with the changes so far, I've refactored the code to use the data export layer already in place and I'm cleaning it up now

mrjones-plip commented 8 months ago

That sample output is great! So other folks can follow along, can you replay back with an updated version of this table describing which each field is? I've intentionally populated them all with questions, you can respond with them as answers ;)

Column	Explanation	Sample
user	The offline user's username in the CHT? Do we capture online users too?	`asc-test-1`
deviceId	UUID of...the unique device the offline user is using?	`d5b945ab-011c-4a3f-be01-550fa3eee147`
date	Date telemetry was taken in YEAR-MONTH-DAY? Will there be one per day multiple per day?	`2023-12-26`
webview	Version of WebView that CHT Android is using?	`107`
apk	CHT Android version on the device? What about if they're using a desktop browser?	`v1.0.4-4`
android	Android version the mobile handset is running?	`8.1.0`
cht	Version of CHT the user was on at time of telemetry? How could these be different for 2 users on same instance?	`4.2.2`
settings	ID of...something? Revision of App Settings JSON maybe?	`15-134ed3054068e36bbcf2ab440a994817`

m5r commented 8 months ago

You got them all right :smile:

Column	Explanation	Sample
user	The user's username in the CHT. No distinction between roles/online/offline	`asc-test-1`
deviceId	UUID of the user's device	`d5b945ab-011c-4a3f-be01-550fa3eee147`
date	Date telemetry was taken in `YYYY-MM-DD` - at most one entry per day (relevant docs)	`2023-12-26`
webview	Version of WebView that CHT Android is using OR version of Chrome	`107`
apk	Version of CHT Android. `undefined` if not using CHT Android	`v1.0.4-4`
android	Version of the Android OS the mobile handset is running	`8.1.0`
cht	Version of CHT the user was on at time of telemetry. These could be different for 2 users on same instance if the telemetry entries are from a different time, or if the user stayed offline for a while before syncing and the instance had been upgraded in the meantime	`4.2.2`
settings	Revision of the App Settings document	`15-134ed3054068e36bbcf2ab440a994817`

m5r commented 8 months ago

Implementation details question, should this be gated behind a new permission or can we piggy back off an already existing one? For now it's behind the can_export_feedback permission that's also used for the export of user feedback (= detected errors and user feedback submitted via the “Report bug”)

mrjones-plip commented 8 months ago

Awesome - thanks for the updated table! That makes a lot of sense.

I think it should be gated behind a new permission - thanks for asking! As it may have a different set up PII (but not likely PHI), it'd be good to allow folks to assign to which ever role the see fit, but still have the ability to split them.

mrjones-plip commented 8 months ago

In the original request, there's not an explicit requirement of something like, "in the CHT Core admin app, show a dashboard that is updated every hour and loads in under 100ms". To that end, we're hoping that a well documented API that returns CSV is an acceptable MVP for this ticket.

This is meant to deliver on all the requested items, while intentionally not solving the issue that the API will likely be very slow to respond with the CSV, thus not suitable for use directly within the CHT as a caching mechanism would be required.

@garethbowen - Please let us know if a slow to respond API which returns CSV is acceptable here!

garethbowen commented 8 months ago

I think slow to respond is ok, depending on our definition of slow. I know it's been tested at 5 seconds on 5k users, but what happens with 50k users or 1m telemetry docs? As long as it still completes before timing out it's probably ok for now.

CSV is not ideal because it's leaving most of the work up to the user. In particular the issue description talks about typical questions admins might ask, eg: "which users are on an unsupported version of the browser". The CSV as it is right now doesn't actually answer that question - it just lists the current browser version with no indication on support. If the API were JSON it would be easy to provide a dashboard which shows only users with potential issues rather than flat file of all users.

The JSON API is also useful for tooling for uses we haven't thought of, eg: app services building a CLI tool to consume it.

mrjones-plip commented 8 months ago

Thanks Gareth!

@m5r - I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format? I defer to you.

For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user. I propose we ship this as is (but with JSON support) and optimize the code later if it doesn't scale sufficiently.

garethbowen commented 8 months ago

For the 5k vs 50k users - I'm going to push back on this. The largest known CHT instances is curretnly ~10k user.

Fair point. Looking at the monitoring api for a large recent instance (which shall go unnamed to protect the innocent) I can see the medic-users-meta db has ~400k docs from ~8k users. It looks like 225k of these are feedback docs, which leaves just under 200k telemetry docs. This will increase by up to 8k docs per day (one per active user).

How confident are we that the proposed solution will scale to 200k or up to approx 2m docs we expect by end of year?

mrjones-plip commented 8 months ago

Thanks for providing some real world numbers that we'll face now (or soon-ish). I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

Regardless, I think that since we're doing a flavor of # 1 with our latest code, it's worth shipping in CHT 4.6 and the refining if we find it needs improvement. # 2 or # 3 (or other performance improvements) will very likely leverage the exact work we're doing now.

To be clear: I'm pushing to ship this as is (well +JSON). I think it's worth it! We can iterate! However, I'd very much like to hear if you disagree :ear:

garethbowen commented 8 months ago

I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.

mrjones-plip commented 8 months ago

Awesome! I'll work to have some tests done before we release it and report back.

m5r commented 7 months ago

I suspect it's easy enough to return JSON instead of CSV - can you confirm? Possibly we could let the user specify the desired format?

The data exporter service is architectured in a way that doesn't give the requester any choice in the format returned, it's either CSV or JSON and it's hardcoded for each exporter. Currently the only exporter that returns JSON is the DHIS exporter and it's structured a bit differently than the other exporters that return CSV.

We can make it work by allowing users to pass a format parameter in the url like ?format=json and restructure bits of the exporters to allow for different formats within the same exporter. If no format is specified, the exporter would default to its current default format (JSON for DHIS, CSV for theothers).

mrjones-plip commented 7 months ago

@m5r - thanks for the update on exporters. It is reasonable to ship only JSON if that ends up being a net less amount of work (sounds like it might). We can then provide easy to follow docs to leverage jq and friends to flatten the results to CSV as needed.

Please decide which path you want to pursue (just JSON or both JSON and CSV) and let us know - thanks!

(and sorry if we end up throwing away CSV efforts)

m5r commented 7 months ago

Quick update, I ended up ditching the CSV format to keep only JSON. It's a small effort to implement one or the other and like you mentioned there are plenty of tools available to easily convert JSON to CSV.

The opened PR is still using the Couch view to query and process telemetry entries but I pushed 2 WIP-ish branches to compare performances between the implementations discussed in Slack last week.
In branch 8462-raw-query, the API processes all telemetry entries with a single couch _allDocs query. It's fast but not memory efficient. In branch 8462-smart-query, the API progressively processes telemetry entries in chunks of 1000. It should be much faster than the Couch view, a bit slower than the other implementation but way more efficient with memory because it never keeps more than n + 1000 (where n = number of users) docs in memory.

With all this, we should be able to make a decision on which solution we'll go for this feature to properly support large scale deployments.

mrjones-plip commented 7 months ago

Thanks for the update @m5r ! I suspect we may be forced to go the smart-query route because if some one has 10k users we may exceed available memory yes? I defer to you.

As well: @jkuester - On your test instance in EKS, I think you should be able to easily switch between @m5r's ECR images from the branch builds as they're close enough that CHT won't care that you switched versions without re-loading all the data in, warming views etc.

mrjones-plip commented 7 months ago

Adding this to 4.6 milestone in case it's ready really quickly - but @jkuester feel free to move it to 4.7 if it's not ready in time - no biggie!

mrjones-plip commented 7 months ago

I think we should test that this scales at least linearly so we'll know the answer of what to expect at scale.

I agree - that will help decide how to prioritise the performance improvements. If it's DOA we need to know before 4.6.0 is released.

Hey all! Noting that thanks to @jkuester and the TDG, we got some good data (see below). @garethbowen commented on slack that this data validates this ticket as is is good to go in 4.6 or 4.7 when ready \o/

Time to run API against a CHT instance

Users	Seconds	Minutes
10	1	0.02
100	5	0.08
1,000	44	0.73
5,000	213	3.55
10,000	511	8.52
20,000	1300*	21.67*

* 20k user data extrapolated.

For testing, each user got 200 telemetry docs, a 1:200 ratio for 2 million docs total. This ratio was derived from analyzing production CHT instances using Watchdog and noting the highest ration seen in the wild was 1:191.

Here's the ratio table for reference, with CHT URls removed:

Version	Feedback Docs	_users Docs	medic-users-meta Docs	Meta to User Ratio	Telemetry entries	Telemetry to User ratio
4.5.0.6922454971	1464	108	2877	26.6	1413	13.1
4.2.2.5489334896	834	194	2793	14.4	1959	10.1
4.2.2.5489334896	37921	11365	64370	5.66	26449	2.33
4.2.2.5489334896	25284	12584	47733	3.79	22449	1.78
4.4.0.6255863600	82917	28750	162834	5.66	79917	2.78
3.14.2	26467	872	33525	38.4	7058	8.09
3.17.1	54	101	576	5.7	522	5.17
3.17.1	241	108	1268	11.7	1027	9.51
4.2.0.5075680573	383	1569	9216	5.87	8833	5.63
4.2.2.5489334896	13652	4161	72211	17.4	58559	14.1
4.5.0.6922454971	4341	612	8285	13.5	3944	6.44
4.5.0.6922454971	879	117	1809	15.5	930	7.95
4.2.2.5489334896	120	2887	5528	1.91	5408	1.87
4.2.0.5075680573	205	579	1248	2.16	1043	1.8
3.15.0-FR-bulk-user-upload	273	331	1320	3.99	1047	3.16
4.2.4.6465855007	1159	3232	9703	3	8544	2.64
3.15.0	120843	2813	223868	79.6	103025	36.6
3.15.0	34175	2975	53244	17.9	19069	6.41
3.17.1	410	135	1970	14.6	1560	11.6
3.15.0	37	161	4947	30.7	4910	30.5
3.15.0	10716538	574	10828326	18865	111788	195
3.14.2	35383	828	135304	163	99921	121
3.15.0	286	167	12596	75.4	12310	73.7
3.15.0	10715318	583	10774344	18481	59026	101

jkuester commented 7 months ago

@garethbowen it struck me today that currently our map-reduce logic is resolving the latest telemetry document for a user. However, if a user is regularly switching between multiple devices (e.g. a phone and the browser on a computer) then the data returned could be incomplete (you would only see the laptop or the phone, not both). Should we actually be reducing down to a key of ${user}-${deviceId}? The downside of doing that is you would get historical results for every device that a user ever had (unless we added some date filtering...(.

garethbowen commented 6 months ago

Including multiple devices would be ideal, but multiple device users aren't common enough that we need to hold up this workflow if it's not an easy improvement. Historical data might actually be useful here, for example, for confirming when a device was replaced, and that the old one was "decommissioned" (is no longer connecting).

I think it's worth a try but don't spend too long on it.

Also consider using an array rather than concatenating the key (ie: [user, deviceId] rather than ${user}-${deviceId}). Then you can use group_level to choose whether you want to get the latest for the user or the user+device (docs).

jkuester commented 6 months ago

:+1: Good call on the array! I have updated the map/reduce to do this. I guess for now I will leave the user-devices endpoint to not set anything for the group_level and instead return records for each user,deviceId pair. But, in the future we could come back and support parameters here for that API to allow choosing the group_level....

medic / cht-core