Database access options for complex queries

pacharanero commented 11 months ago

Noting some thoughts following meeting today discussing backend DB access for E12 analysts in order that they can perform complex arbitrary queries on the database. Just starting this as an Issue in order to get views from the team. These options are not mutually exclusive, some may suit particular queries or reports better than others.

Options for access include:

1. Allow direct access to a read-only PostGIS role (user)

This option would require us to 'punch a hole' in our firewall to allow access from an external computer into the VPS instance which houses the Live server. We would connect to the Database Data Volume and allow an external tool like pgAdmin to connect to PostGIS on port 5432. This would almost certainly require opening inbound ports (security risk) or tunnelling the connection over a VPN (possibly less risk but adds user complexity) Advantages: Gives essentially complete DB access, enabling any query to be executed. No developer time needs to be spent on Disadvantages: Potentially adds security risk; Non-TRE[^1] access creates a larger data breach risk, and is against general direction of travel on NHS research data access, which should all be via TRE[^2].

2. Build Django queries into the E12 application itself

This is particularly suitable for repetitive and unchanging, regular queries, for example compliance reporting or other reporting obligations which are not going to change much over time. This lends itself to creating a Reporting page which is accessible only to E12 team members, and then allows these pre-made reports to be downloaded in a suitable format. Advantages: can potentially automate away a lot of boring work; Security is similar to existing arrangement in E12 app (2FA); E12 analysts can 'join the dev team' and their Django/Pandas queries can be dropped straight into our codebase; Data breach risk is less because only the fixed, agreed queries can be executed - not any arbitrary query. Disadvantages: requires E12 development work to add each one; queries cannot be bespoke or change rapidly.

3. Build a 'SQL Explorer' tool inside E12

https://pypi.org/project/django-sql-explorer/ is a django plugin which would add SQL execution functionality into the Django Admin interface, which we already have in E12, and which is subject to tight security controls already. Advantages: Built into the application we are already using; would not really require a massive amount of development to implement using a library like this; We can use existing E12 2FA security; Would also work for NPDA and other audit tools. Disadvantages: Would need to ensure ONLYT read-only access is granted; Requires an additional library dependency.

4. Add a SQL Explorer tool as a 'bolt-on'

https://github.com/ankane/blazer is an example of a separate tool which would run on the server in its own Docker container, with access to the same Data Volume as our database, and the credentials of a read-only user role. It would be served at a URL something like https://e12.rcpch.ac.uk/blazer, or something like that. Advantages: Relatively straightforward to implement; Fulfils the definition of a TRE; Less security risk than direct DB access. Disadvantages: Would likely have its own security arrangements (eg username/password) which would be different to E12 login. However this would only be required for a small number of highly trusted users.

5. REST API database access

We could make it possible to obtain certain data through the Django REST API, which potentially makes it available to other applications such as Excel (using REST API calls / Python). Security would likely be via issuing an API key to the consuming application. Advantages: Opens up potential to integrate with external tools; Obviates need to build UI elements. Disadvantages: Would require each model in the DB to be exposed as REST; Multiplicity of models means that potentially a number of separate, linked queries would be required in order to extract the data needed. Python in Excel is relatively new and not widely used.

All of the above is for discussion, happy to build and add to these Advantages/Disadvantages and develop a thorough understanding of the user/business need VS technical implementation difficulty VS security/data breach risk.

[^1]: Trusted Research Environment [^2]: Data access policy update: proposed draft https://www.gov.uk/government/consultations/data-access-policy-update-proposed-draft/data-access-policy-update-proposed-draft

nikyraja commented 11 months ago

Thanks Marcus, this is comprehensive and incredibly useful.

My initial preference would be to either 'Build a 'SQL Explorer' tool inside E12' or 'Add a SQL Explorer tool as a 'bolt-on'. These would allow the E12 team to run bespoke queries - it is important that the project team have the ability to do this so we can carry out added value projects, data access requests etc. It would also be great to have our analysts more involved in the 'back-end' of the platform, and this seems like a fitting opportunity.

We can however, also specify some results that are standard & repeated, such as the KPIs, so these queries can be built in and report automatically generated with these results. This will be beneficial as it will minimise the work needed to be done manually by the team, however this will not work for all of our outputs/processes.

We can discuss this further in a meeting - Thanks!

pacharanero commented 10 months ago

Update following Data Extraction meeting 2024.01.10 with @nikyraja @AmaniKrayemRCPCH @pacharanero @dc2007git Hannah @sairapons

Notes from Teams chat: [11:37] Niky Raja Option 2; build export function for KPI data (THB, regional, national etc); only available for E12 team at first, but could be used to send quarterly reports to individual services. Can also do this for any other repetitive analysis that could be automated.

Build SQL explorer within E12 (not bolt on) to allow E12 team to run bespoke queries. [11:41] Niky Raja Priroity: build SQL first, and then build django for KPIs

We feel that Option 3 (SQL Explorer built into the Admin interface) above is the most important priority for development. (separate Issue will be created)

We will also need to use a little of Option 2 (Bespoke Django queries in code which result in KPI or other regular analysis dashboards) (each Query required will need an Issue to be created specifying the details of the query, a little about the purpose to help us write it better, and any access controls or other automation required.)

Option 5 (API Access) will get very messy for data extraction, although API remains a useful mechanism for inputting new Case data by trusts, for example.

Option 1 (Direct access to a SQL DB) was re-discussed and apparently this has worked well for some other data projects (Badgernet) by 'mirroring' the live SQL DB into a local SQL DB, where it can then be directly interrogated by the StatR tooling. So this could be an option for queries which don't fit well in Option 3 or 2 approaches and we will keep it 'on the table'. There is a slight downside which is that it is not what I would consider 'TRE' access in that the entire DB is copied locally for analysis.

eatyourpeas commented 10 months ago

Sorry late to the conversation. My preference would be for Option 1. I don't see it as a big vulnerability from a security point of view so long as all the credentials are tightly controlled and locked to a few key users. It would give the team direct access to the database to run queries as they see fit using PgAdmin. It would avoid using another dependency too, which as you know I am not keen on. The team would have view only access. It would be quickest to implement and my guess is the E12 team are very good at SQL queries and will find the PGAdmin environment (which is really good) more powerful and intuitive than a plugin for django (though I don't know much about Sql Explorer so may be wrong here). Also, I really would prefer for E12 team not to need django admin at all and remove superuser access from them because this gives read/write/drop access to all tables.

My favourite option remains otherwise option 2: the more we build queries into the platform, the better it will be in the longterm and also it means clinicians would benefit as well as E12 team from the work, and the queries could be run against their organisation/trust/health board as well as nationally so they could potentially act on findings in clinical care. And in real time, not just as a one off. This of course is more work, and as you say, will not meet all use cases. So for sure it is not the only solution, but for common things definitely a preference.

Option 5 also I would argue is good - I remind everyone that the API is actually already built - I spent much of the summer on it - and switching it on would not be much work. Again though, it is more powerful as a mechanism to write rather than to read, though both are possible. The API has django-filter built in, so is capable actually already of quite complex queries within tables, though we need to think about bespoke endpoints for inner joins and so forth.

pacharanero commented 9 months ago

OK perhaps we are going to need another meeting then, to discuss this further. Can we arrange one that Simon can get to?

I have concerns about Option 1 due to the complete access it provides to the database, through the firewall. It can be done safely, but it is likely to be an area of risk. Option 1 also completely precludes Database Encryption, which we have been asked to implement by the Data Protection advisory team that were hired. If we are going to do this then we should probably have a secure tunnel which users would use to connect to the Live VPS, and then from there they would authenticate to PostGIS with a read-only Role which is purely for E12 staff to read the DB with. I think the additional security would be a headache to set up and manage, and would cause frequent end-user difficulties.

Option 5 I think might work but probably represents the hardest technical option for E12 users, regardless of whether the API is already built or not. Simply to access a complete E12 record they would have to hit the Case, Registration and Site endpoints, not to mention any other information such as EpilepsyContext, Investigation - essentially one hit for each model we have, unless we break every rule of REST and offer a combined 'everything' endpoint, which would be essentially a RPC, not REST. And still there would be the problem of what we connect this to on the E12 side? Does StatR do REST? How easy is it to build a suitable client?

I agree about Option 2: Building queries into the platform is slightly more development work for us, but after that the analyst work drops to minimal, for repeated queries anyway. The downside is the lead-time required for building a query. This need not be a dealbreaker though.

The meeting yesterday was useful in understanding the way that the analysts will use the data, and we had agreed a way forward. I think before we reverse that decision entirely, for example to use Option 5, we would need to ensure the E12 team fully understood how exactly they would be required to access the API. In fact I'd want to see a demo of it, to ensure they would be able to use it in the way envisaged.

If we remain unsure of the way forward then we might have to do something as an interim measure and then let the new Lead Developer make the final decision.

RichardBurleyRCPCH commented 9 months ago

Probably sensible Marcus.

Appreciate the pace you're all working at here, but we don't want to rush it, obviously.

I feel like we're close to arriving at an agreed position here tho. May only need half an hour?

Richard Burley

Executive Director of Digital

Royal College of Paediatrics and Child Health

Web: http://www.rcpch.ac.uk http://www.rcpch.ac.uk/

From: Marcus Baw @.> Sent: Thursday, January 11, 2024 13:46 To: rcpch/rcpch-audit-engine @.> Cc: Richard Burley @.>; Assign @.> Subject: Re: [rcpch/rcpch-audit-engine] Database access options for complex queries (Issue #683)

OK perhaps we are going to need another meeting then, to discuss this further. Can we arrange one that Simon can get to?

I have concerns about Option 1 due to the complete access it provides to the database, through the firewall. It can be done safely, but it is likely to be an area of risk. Option 1 also completely precludes Database Encryption, which we have been asked to implement by the Data Protection advisory team that were hired. If we are going to do this then we should probably have a secure tunnel which users would use to connect to the Live VPS, and then from there they would authenticate to PostGIS with a read-only Role which is purely for E12 staff to read the DB with. I think the additional security would be a headache to set up and manage, and would cause frequent end-user difficulties.

Option 5 I think might work but probably represents the hardest technical option for E12 users, regardless of whether the API is already built or not. Simply to access a complete E12 record they would have to hit the Case, Registration and Site endpoints, not to mention any other information such as EpilepsyContext, Investigation - essentially one hit for each model we have, unless we break every rule of REST and offer a combined 'everything' endpoint, which would be essentially a RPC, not REST. And still there would be the problem of what we connect this to on the E12 side? Does StatR do REST? How easy is it to build a suitable client?

I agree about Option 2: Building queries into the platform is slightly more development work for us, but after that the analyst work drops to minimal, for repeated queries anyway. The downside is the lead-time required for building a query. This need not be a dealbreaker though.

The meeting yesterday was useful in understanding the way that the analysts will use the data, and we had agreed a way forward. I think before we reverse that decision entirely, for example to use Option 5, we would need to ensure the E12 team fully understood how exactly they would be required to access the API. In fact I'd want to see a demo of it, to ensure they would be able to use it in the way envisaged.

If we remain unsure of the way forward then we might have to do something as an interim measure and then let the new Lead Developer make the final decision.

— Reply to this email directly, view it on GitHubhttps://github.com/rcpch/rcpch-audit-engine/issues/683#issuecomment-1887194364, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A7KFVSRLLL7VNM3527WIRQLYN7UJTAVCNFSM6AAAAABAF3RQSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGE4TIMZWGQ. You are receiving this because you were assigned.Message ID: @.***>

This email has been scanned for spam & viruses. If you believe this email should have been stopped by our filters, click herehttps://portal.mailanyone.net/index.html#/outer/reportspam?token=dXNlcj1yaWNoYXJkLmJ1cmxleUByY3BjaC5hYy51azt0cz0xNzA0OTgwNzY3O3V1aWQ9NjU5RkYxMUYzMjMxMzEzREM0RTRENjVFOTMwQjI2QTU7dG9rZW49ZjdhZTBjMGE5NTg1ODg4NjRjYmYyZDliOTYyZjU4ZjY4MGYxZjIzZDs%3D to report it.

eatyourpeas commented 9 months ago

Update to this - meeting took place and we have settled on Option 3

pacharanero commented 8 months ago

@mbarton I thought this issue might be of interest to you since it is obviously a fairly important new feature and has implications across security, technical feasibility, time to implement, etc so I feel you should get an opportunity to have a think about these options, comment and review the decision. This issue was closed following a decision at the second meeting to opt for Option 3, but if you felt that another option was better we would definitely want to reconsider. A new issue was created to slate the work for future development here https://github.com/rcpch/rcpch-audit-engine/issues/739

mbarton commented 8 months ago

Thanks @pacharanero! I definitely think allowing free-form queries will be vital and we can probably write some postgres roles that restrict access down to avoid accidents.

It might have been noted down elsewhere but it would be good to record what type of queries we anticipate people running. My main security concern isn't necessarily a vector to access to the entire database but that by default read access would also include the PII of our cases.

If queries are mostly aggregate then maybe we could consider access to a periodic anonymised dump rather than live itself?

nikyraja commented 8 months ago

Most of the queries would be aggregated to Trust level, and higher, however we do need access to the patient level data to handle queries.

For example, we say that 2/10 CYP in a trust weren't seen by an ESN. We need to be able to identify those 2 CYP and communicate their details and why they haven't met the KPI. I'm assuming that a fully anonymised dataset wouldn't allow this?

The other example is when we get data-access requests that require us to share identifiable or pseudonymised data, eg. including NHS numbers

mbarton commented 8 months ago

Thanks @nikyraja! Makes sense, let's see what we end up doing regularly and can automate and improve security as we go

rcpch / rcpch-audit-engine