Log analysis for self-hosted instances

derickl commented 2 years ago

We have a likely upcoming request from MoH UG, on summarising access to the system. They are keen to audit user activity. Given we haven't hooked up logtrail to it, what options or recommendations would you have towards this end?

kennsippell commented 2 years ago

Assigning to @craig-landry

I'll follow up with Gareth and Hareet. Regarding timing, everyone is on CHT 4.0 / Archv3 right now. If this is going to need a bunch of effort it'll come at the expense of that effort. If we can keep it small and simple though I don't think it'll be too tough.

https://medic.slack.com/archives/C01LMSDHB9D/p1648580281100539?thread_ts=1648540827.074429&cid=C01LMSDHB9D

mrjones-plip commented 2 years ago

@derickl - can you provide more information about exactly what data points the MoH wants to know?

Is this truly log analysis or is there a specific question about user activity that MoH wants to be known, regardless where the source of data is?

Finally, how will the data be consumed? Will the output of a grep call be satisfactory, or does this need to be viewed in a Klipfolio or Superset dashboard?

henokgetachew commented 2 years ago

Something to note here is that user activity auditing (In the traditional sense of the word) is tricky in an offline-first application such as ours. I think we'll probably have to educate them on this because most of the activity happens on the client side without having to send requests to the back end.

I think there are two parts to this ticket.

What this project needs to analyze right now
A more general solution that we could package or recommend for self hosting partners.

I think the first one falls into the support dashboard. I think it could be achieved via grep within the log folder if they know what they are looking for.

The second one is more of an engineering task that would require listing candidates, evaluating them, and choosing a winner based on that evaluation.

mrjones-plip commented 2 years ago

Excellent points, thanks @henokgetachew !

Once we hear back from @derickl on what's needed, we can see what logical next steps are.

To kick things off, assuming output of grep is a valid solution to this ticket, I explored how to capture some POST calls which represent logins. While the rest of the app usage can be hidden from the server because of offline first architecture, logins must happen online:

First figure the name of your HA proxy container:

$ docker ps --format="{{.Names}}"|grep -i haprox         
helper_test_haproxy_1

You can find successful logins with this docker logs call and grep looking for 200s:

docker logs helper_test_haproxy_1 |grep "200,POST,/_session,-"
Mar 29 22:09:06 ce16e6d1f508 haproxy[25]: 172.18.0.3,200,POST,/_session,-,medic,'{"name":"medic","password":"***"}',403,1,46,'-'
Mar 29 22:09:46 ce16e6d1f508 haproxy[25]: 172.18.0.3,200,POST,/_session,-,medic,'{"name":"medic","password":"***"}',403,1,46,'-'
Mar 29 22:09:49 ce16e6d1f508 haproxy[25]: 172.18.0.3,200,POST,/_session,-,medic,'{"name":"medic","password":"***"}',403,1,46,'-'
Mar 29 22:10:38 ce16e6d1f508 haproxy[25]: 172.18.0.3,200,POST,/_session,-,foobar,'{"name":"foobar","password":"***"}',402,1,44,'-'

To find failed login attempts, you would look for a 401 instead of a 200:

docker logs helper_test_haproxy_1 |grep "401,POST,/_session,-"
Mar 30 14:28:09 ce16e6d1f508 haproxy[25]: 172.18.0.3,401,POST,/_session,-,medic,'{"name":"jane","password":"***"}',390,1,67,'-'
Mar 30 14:28:21 ce16e6d1f508 haproxy[25]: 172.18.0.3,401,POST,/_session,-,medicd,'{"name":"jane","password":"***"}',390,1,67,'-'
Mar 30 14:28:23 ce16e6d1f508 haproxy[25]: 172.18.0.3,401,POST,/_session,-,medicd,'{"name":"jane","password":"***"}',390,1,67,'-'
Mar 30 14:28:35 ce16e6d1f508 haproxy[25]: 172.18.0.3,401,POST,/_session,-,medicd,'{"name":"jane","password":"***"}',390,1,67,'-'

If you want to query the CHT about a specific user, say who jane who couldn't log in, you can use curl plus jq to get their UUID, phone number and role via the API:

curl -s https://medic:password@192-168-68-17.my.local-ip.co:8443/api/v1/users | jq '.[] |select(.username=="jane")  | .username, .rev, .contact.phone, .contact.role'
"jane"
"1-695a58c8fd902f4eae06aae63edbe0b8"
"+254712345678"
"chw"

NB - if you want check all POST as the docker logs call may not return all of them, you can use an exec call to grep the log file directly:

docker exec -it helper_test_haproxy_1 grep POST /srv/storage/audit/haproxy.log|grep "200,POST,/_session,-"

derickl commented 2 years ago

@mrjones-plip regarding https://github.com/medic/cht-infrastructure/issues/18#issuecomment-1082375785 (really apologise for the delay in getting back to you)

MOH wants to feel confident that they can tell what actions users took in the system. From the last call, they would need usernames, IP addresses and the actions that said users took.

Looking at https://github.com/medic/cht-infrastructure/issues/18#issuecomment-1083320237, this is too high level and doesn't quite capture what happened.

@henokgetachew from your comment here, CHT is being used to run a health system and government needs to be confident they can track changes to the data in the system. We need to have a slightly better solution than 'grep within the log folder if they know what they are looking for' - they are not CHT experts

mrjones-plip commented 2 years ago

Thanks for the feedback @derickl! Can you clarify what you mean by "actions that said users took"? Login, logout, every form submitted, edited or deleted? Or more abstracted, like "how many household visits"? Let's assume it is any time a document is synced to couch and then we'd use the name of the document as the "activity" (but let me know if I'm wrong!)

Because we're offline first a trio of "ip/username/action" may not be possible. If a CHW logs in once on WiFi, goes offline, then does 30 household visits and creates 60 docs and then syncs via cellular data, which IP should use for those actions? I think it would be helpful to educate MoH on how CHT works offline to set expectations.

I suggest a dashboard showing a high level table of users and their aggregated activity. This data is a mashup of HAProxy logs as well as couch2pg data in postgres:

User	Logins	Actions	Last Seen	Last IP
Lisa	22	343	2 Feb 2022	192.168.1.1
Ann	1	642	3 Feb 2022	10.0.1.1
Christina	2	343	1 Feb 2022	2345:0425:2CA1::0567:5673:23b5

Clicking a row would give you a chronological list of activities based on document names as seen in postgres:

Lisa Detail

Item	Date	Detail
Login	2 Feb 2022	192.168.1.1
Document	3 Feb 2022	Register Pregnancy
Document	3 Feb 2022	Death Report
Document	4 Feb 2022	U5 Checkup
Login	5 Feb 2022	110.0.1.1

derickl commented 2 years ago

Thanks for the follow up @mrjones-plip

Login, logout, every form submitted, edited or deleted?

This is closer to what would be needed. Assuming you wanted to review what we log and be able to tell what happened, what would you be looking for? We need to approach this with some empathy and try to understand where they are coming from. At this point, having the ability to audit is what they have as a need. We haven't gotten deeper on that ask but as I mentioned in the previous thread, they wanted to tie actins to users (and IP if possible). Web servers / proxies do log this. Right?

If a CHW logs in once on WiFi, goes offline, then does 30 household visits and creates 60 docs and then syncs via cellular data, which IP should use for those actions?

Interesting question. What do we currently log?

I think it would be helpful to educate MoH on how CHT works offline to set expectations.

Are you able to summarise this in a way that can be shared to MoH and also highlight our gaps in auditing and how it ties back to offline first? It would be great to highlight what we can and can't do and also call it out in our docs.

I suggest a dashboard showing a high level table of users and their aggregated activity. This data is a mashup of HAProxy logs as well as couch2pg data in postgres:

Would you be open to helping build out a proof of concept for this?

mrjones-plip commented 2 years ago

NB - this ticket is in a public repo, so all of this ticket is public

Web servers / proxies do log this. Right?

Yup! They're very literal though: when you connect, they log a GET or a POST and the IP. They don't know if who you are, that's the the job behind the proxy (CHT). We'd have to do some more work to join the actions together with usernames.

Interesting question. What do we currently log?

what it would log is the IP for when you did the bulk upload, encompassing many in the field forms you created over time and space. Here's a log entry of my offline user logging in, going offline for ~20 min and then bulk uploading some docs after being offline while creating them:

Apr  6 22:02:05 3c9e67d5b8be haproxy[26]: 172.18.0.3,200,POST,/_session,-,mrjones,'{"name":"mrjones","password":"***"}',405,2,45,'-'
Apr  6 22:24:04 3c9e67d5b8be haproxy[26]: 172.18.0.3,201,POST,/medic/_bulk_docs,-,mrjones,'{"docs":[{"form":"pregnancy_facility_visit_reminder","type":"data_record","content_type":"xml", [DATA-TRUNCATED]

The HAProxy log, having looked at it, is meant to record EVERYTHING, down to the raw data of each form submitted. It is, however, accordingly verbose and non-trivial to parse.

Are you able to summarise this in a way that can be shared to MoH and also highlight our gaps in auditing and how it ties back to offline first? It would be great to highlight what we can and can't do and also call it out in our docs. Would you be open to helping build out a proof of concept for this?

Yes, I'd be open to helping and to summarize offline functionality as needed. some fun discussion on this! I think it'd be helpful to go over this real time - I'll try and schedule some time with you!

mrjones-plip commented 2 years ago

After chatting with @garethbowen about this, we think the best approach for an MVP is to add 3 key pieces of information to the docs site in response to this ticket:

A new section specifying all the logs that a CHT instance generates including their default location in a self-hosted deployment.
A detailed description of the parts of the HAProxy log which is explicitly put in place to be able to go back in time and audit when and what happened and who did it in a given CHT instance.
A suggestion to use Kibana and then maybe Logtrail. This will allow for ad hoc, in depth and non-grep based solutions. Ideally we can chain this up to a tabular visualization per my prior comment.

After this is completed, we can work with any interested deployments and MoHs to see if this is sufficient to meet their audit needs in a self hosted scenario.

I'm out of the office until Apr 18th at which point I'll resume working on this!

derickl commented 2 years ago

mrjones-plip commented 2 years ago

Awesome - thanks @derickl - I'll respond to that post you cited today with a similarly high level response: use HAProxy, don't use mutable couchdb, look to use a web friendly too like kibana, be very careful with PII/PIH. Then for this ticket and the forum post, I'll come back with some best practices after testing in our Kibana deployment to give some actionable steps.

There still may yet be some additional docs to publish around this!

mrjones-plip commented 2 years ago

Closing this with a link to the forum post outlining a solution for this. Feel free to re-open as needed!

medic / cht-infrastructure

Log analysis for self-hosted instances #18

Lisa Detail