nhsx / open-source-policy

Open Source Policy development for the NHS
Other
47 stars 11 forks source link

Database schema #6

Open ChrisBeeley opened 2 years ago

ChrisBeeley commented 2 years ago

When sharing analytic code it's important to establish the extent to which it is safe or desirable to share database schema (raw SQL statements, column names, DB and table names...)

I have been told that we cannot share material of this sort because a) sharing DB and table names poses a security risk b) database vendors hold copyright of schema and names, although these objections appear to be contradicted by other government guidance

I can't find this addressed in the policy or any of the linked documents. It would be very useful to have this explicitly addressed in this guidance

otlah commented 2 years ago

Absolutely, it's not addressed here (yet), and it would be useful to include that kind of detail. I fully agree with the CDDO guidance but I'll leave this issue open while we investigate the copyright concern further.

connor1q commented 2 years ago

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

ChrisBeeley commented 2 years ago

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

Absolutely. A definitive statement on security seems relatively straightforward and I would hope that it is added to this guidance in due course.

As for the copyright issue, somebody needs to confront the vendors about this and get them to make a statement on it one way or the other. I tried and was met with a wall of silence. It's a simple question with a simple answer and the vendors need to play fair and answer it

otlah commented 2 years ago

Hi folks, if you'd like to work together to propose a security statement on this topic I can check it through and find the right space for it next week?

connor1q commented 2 years ago

Hi @otlah, @ChrisBeeley

I'll try to get hold of someone from security who can give a steer.

While table names should never be risky, I could imagine a situation where the CNAME address for a database might be risky. Security tends to take the onion philosophy - so the more protection the better. We just need to be able to identify a sensible balance of risk versus value in sharing code.

For that type of environment configuration though (database address), our open code guidance should explain how to avoid storing those things in open code. E.g. using a secrets manager or using environment variables. That type of thing would go along other code security practices like running githooks to avoid credentials being pushed to git and using .gitignore files.

The value of open code comes from seeing the business logic - not the database details. So it is no loss to mask that info.

ChrisBeeley commented 2 years ago

Indeed, yes, I have no problem with running odbcConnect(Sys.getenv("DB_NAME")) or whatever it is it's column names that are problematic. Short of renaming the entire table they can pop up throughout the DB and analytic code. It's hard to believe that there would be any problems caused by my publishing time_from_referral = df$date_seen - df$date_referred (assuming they were real column names) but we have mountains of unpublished code that looks just like that

connor1q commented 2 years ago

Hi @ChrisBeeley @otlah I've not been able to get hold of a security person for this unfortunately.

I have seen that the Goldacre report is out and sets out a very strong expectation that code should be open: https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis/better-broader-safer-using-health-data-for-research-and-analysis

ChrisBeeley commented 2 years ago

But is itself frustratingly short on detail

connor1q commented 2 years ago

That's true - but probably expected for a report like this. The report does waggle its eyebrows suggestively at the right organisations to get involved in setting out the specifics.

E.g. governance people image

and the ICO image

Perhaps we could convene a wider group to focus on this draft policy and try to elaborate on some of the missing pieces?

ChrisBeeley commented 2 years ago

Indeed, yes. Not a criticism of the report, which I think is great. Just saying we need a different approach here. A technocratic, this is the answer, tablets of stone, job done type thing.

100% up for convening a "detail" group, you know where to find me

wbryant commented 2 years ago

I'd also be more than happy to contribute to a group like this.

otlah commented 2 years ago

Hi folks, we've still got a couple of major groups to get feedback from (including in UK research funding), but once that feedback's in we're intending to gather a list of issues that have to be solved before we can publish a version 1.0, and a list that we can take longer over to unstick. I'm stoked to have a bunch of volunteers to help populate those lists!

sebbacon commented 2 years ago

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

susheel commented 2 years ago

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

ChrisBeeley commented 2 years ago

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

My Trust does use a TPP database, so that might be an angle for me to try @sebbacon . Don't suppose the permission is published anywhere, or generalisable, or anything I can use to try to push things on where I am?

ChrisBeeley commented 2 years ago

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

This is amazing. I had no idea this existed. I need to go and read this carefully. Thanks!