mingward commented 5 months ago

Describe the bug The HAPI FHIR server AuthorizationInterceptor only limits the query results, not the resources that can be searched. Consequently, a user can craft queries to deduce the values in resources they should not have access to. This presents data leak threats to protected data.

To Reproduce Here is an example demonstrating the "data leak". Assume that on a FHIR server, we have two resources: a) Resource "Patient" has the patient ID twice de-identified. For demonstration purposes, we make this resource public so you can try it. The example URL is: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-small/x1/Patient?&_count=1

b) Resource "Observation" requires authorized access. So, https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-small/x1/Observation shows a "denied access" message.

c) Next, we will show that even without authorization to the "Observation" resource, you can still use a set of queries to find out a patient's values in the "Observation" resource for a measurement called "phv00493247.v1.p1".

Step 1: Run the following query: https://dbgap-api.ncbi.nlm.nih.gov/fhir-jpa-small/x1/Patient?_has:Observation:patient?_has:Observation:patient:code-value-quantity=phv00493247.v1.p1$gt3

You will get 4 patients below. Using binary search to eventually deduce the measure of a patient in the series of query below.	Query	Patients
phv00493247.v1.p1$gt3	["3778609", "3778569", "3778606", "3778571"]
phv00493247.v1.p1$gt4	0	All 4 patients' values are between (3, 4)
phv00493247.v1.p1$gt3.5	["3778571"]	Patient 3778571 has value between (3.5, 4).
		Other 3 patients have value in (3, 3.5].
phv00493247.v1.p1$gt3.75	0	Patient 3778571 value in (3.5, 3.75]
phv00493247.v1.p1$gt3.625	0	Patient 3778571 value in (3.5, 3.625]
phv00493247.v1.p1$gt3.5625	["3778571"]	Patient 3778571 value in (3.562

Expected behavior For users with no authorization to Observation, the endpoint should give an error message.

Environment (please complete the following information):

HAPI FHIR 5.7.2 REST Server (FHIR Server; FHIR 4.0.1/R4)
linux Alma8

Additional context In above demonstration, we assumed patient is public. But the same data leak exists if a user has authorization to a set of patients and a set of Observation belong to StudyA. This user has no authorization to Observation entries belong to StudyB. But with current AuthorizationInterceptor, the user can deduce the values belong to StudyB.

To use FHIR for research data including "controlled access data", it is of utter importance to prevent such data leak. Thank you in advance for your attention to this.

jamesagnew commented 5 months ago

Interesting. Do you have any suggestions on what a good fix would be for this issue?

I suppose the simplest solution would be to forbid _has searches when the authorization interceptor is in use at all. A slightly more refined option might be for it to ban them when any kind of compartment based rule is in place..

jamesagnew commented 5 months ago

Thinking this through a bit more... I guess what we should really be doing is blocking _has if there isn't a rule which allows "global" access to all resources of the source resource type for the _has expression.

mingward commented 5 months ago

Thanks. But reverse chaining searching is a great feature. Our tech lead has a few suggested fix.

one is automatically adding search parameter variants that access the security label for every search parameter (e.g., if there is a search parameter code, then there is automatically a security-code parameter, and code-value-quantity implies security-code-value-quantity) and then a SearchNarrowingInterceptor that automatically replaces normal search parameters with their security- variants using a specified list of security codes and replaces the security- variants with one whose security parameter target value is the intersection of that security code list and the ones the user specified.
He also described another fix: (more flexible but more challenging for the user to debug) is to have a SearchRewritingInterceptor that takes a parsed search and returns a different parsed search. The returned search contains the original search so that it can be displayed to the user (thus, the rewriting remains invisible). But the actual query executed will be the rewritten one. (This might even be a QueryRewritingInterceptor that allows rewriting other FHIR queries like PUT and DELETE.) A user could then use this to write the above SearchNarrowingInterceptor or something less general that met their specific security needs.

Of the two fixes mentioned above, the favor is the security search narrowing interceptor. It will solve most of the problems faced by most users. It will also share code that can be polished along with the rest of the server and improve over time.

jamesagnew commented 5 months ago

Using SearchNarrowingInterceptor would certainly be an avenue.. I haven't thought it through completely but at first blush I suspect this would mitigate the attack you're describing. I'm not sure I follow the QueryRewritingInterceptor part - You should be able to achieve what you're describing using the stock SearchNarrowingInterceptor today, assuming your security rules are restricting the user to accessing a single patient compartment.

mingward commented 4 months ago

Thanks James for the comment of "SearchNarrowingInterceptor would certainly be an avenue". About "not sure I follow the QueryRewritingInterceptor part", just like to let you know that our tech Lead Eric will be commenting here soon.

RadixSeven commented 4 months ago

You should be able to achieve what you're describing using the stock SearchNarrowingInterceptor today, assuming your security rules are restricting the user to accessing a single patient compartment.

Our security rules do not restrict access to a single or even multiple Patient compartments. We needed more than compartments could provide. (I explain the problems in Appendix A.)

Instead, each resource has a meta.security tag with a set of codes, and each request has a list of codes for the resources it can access. If the intersection of the request's and resource's codes is non-empty, the request can see the resource; otherwise, it cannot. A typical request can see a handful of codes.

This security is equivalent to adding an "and" operation to each term in the original query. If a request can access two security groups, Group1 and Group2, then Observation?code=asthma becomes Observation?code=asthma&_security=Group1,Group2. Because one resource can refer to multiple resources, using a separate parameter in chaining is not sufficient. One must use an expression like Patient?_has:Observation:subject:code-security=asthma$Group1,asthma$Group2 not Patient?_has:Observation:subject:code=asthma&_has:Observation:subject:_security=Group1,Group2. The form with the separate security parameter can return a Patient with an Observation in Group3 with code "asthma" and a non-matching Observation in Group1. The security system should have hidden this Patient resource. Fixing this requires creating a new code-security search parameter.

I'm not sure I follow the QueryRewritingInterceptor part

The QueryRewritingInterceptor, while a potential solution, comes with its own set of risks. It provides a way to rewrite the queries, as I have above, without restricting them to narrowing. However, it provides less control over the product for the Smile CDR Team and may be easier for you to implement. It's crucial to be aware that using it requires caution to avoid making the system insecure. While it's a solution I thought of, I would recommend considering other options. I think the SearchNarrowingInterceptor with the extra automatic security SearchParameter resources is the best.

Appendices

A: Inadequacy of Patient Compartments

Patient compartments are simultaneously too restrictive and too broad. They are too restrictive by omitting resources we want to include and too broad by including other resources we want to exclude. Informed consent documents signed by the study participants whose data we archive dictate our inclusion and exclusion rules.

As an example on the restrictive side, we have Observation resources based on Specimen and Encounter resources, as well as Patient resources. Any Specimen resource for a Patient is in its compartment, but the compartment excludes an Observation referencing that Specimen unless it has its subject set.

As an example of being too broad, we can consider the case of resources from an individual who participated in two studies. We need to handle four types of access: • None of the data. • The data from Study 1. • The data from Study 2. • The data from both studies.

However, the Patient compartment only allows us to handle two types of access: see none of the data or the data from both studies.

We can work around some of the issues. For example, we can pretend that the subject in two studies is two different people (and handle all the tech support from users when this messes up their analyses). However, a fundamental semantic mismatch exists between what we are doing and the compartment model.

Even if we enact sufficient workarounds to allow us to use Patient compartments, we'll need to use many for each request. Users often have access to hundreds of thousands of research subjects. Though I haven't tried it (because of the earlier issues), I am wary of adding hundreds of thousands of terms to each query.

B: If we could change the FHIR standard

Create a ResourceGroup resource. Every resource can reference one or more ResourceGroup resources. Define a ResourceGroup compartment that contains all resources that reference that ResourceGroup. Allow searches to limit to one or more compartments.

C: Frequently Suggested Solutions

When I discuss this with other developers, I frequently get the following suggestions. We are still investigating one, but the others do not meet our constraints.

C.1: ConsentInterceptor

ConsentInterceptor cannot solve the data leak problem. It allows you to edit the resources being returned after the query has finished. However, the list of resources already contains whatever information was revealed by the query. There is nothing in the list that tells the system that Patient 1 is in the list because a measurement the user may not see was above 0.5. So, the additional power to edit the list does not help. The ConsentInterceptor is applicable to removing resources or removing private fields when a user is not allowed to see the whole resource.

C.2: Sharding/Multitenancy

I address these together because they are identical from a user perspective. They differ only in implementation.

Sharding divides up the data onto different FHIR servers. Multitenancy makes it look like you have different FHIR servers, but there is only one underlying FHIR server. Each tenant or shard gets its own base URL. There are also multitenant variants where tenants have a hierarchical relationship. example.org/NIH would appear to be a server that has everything in example.org/NIH/NLM and example.org/NIH/NHGRI, but example.org/NIH/NLM and example.org/NIH/NHGRI would be unable to see or affect one another.

For brevity, I’ll call all these solutions “sharding” going forward. There are four variants of sharding.

Shard by security code. a. Without a façade b. With a façade
Shard by user/access group. a. Without a façade. b. With a façade.

The façade makes the multiple shards appear as one FHIR server from a user perspective.

Sharding works around the data leak problem by only allowing users to access servers on which they are permitted to see every resource. If you have permission to see all resources, then there is no leak.

All the sharding solutions share the problem of requiring lots of servers and databases. They also duplicate at least the public data. This large amount of data can lead to expensive updates.

C.2.1 Shard by security code

Most shards contain exactly one security code plus all public data. There is also a shard with no controlled data and only unrestricted data.

Without a façade

• Many queries: many users are approved for more than one study. So, to answer their queries they will need to send queries to each shard. • Exposing codes: This also exposes the codes to the user. Learning about the different security codes provides another barrier to entry. • No cross-code queries: No query can look at items in two different codes. • Extra burden: In the ideal case, the researcher can query all the data they’ve been given access to. Here, the minutia of access control must always be part of their considerations and their queries. • Duplicates the public data: There are about 4000 codes, so the public data will be duplicated about 4000 times. It’s small compared to the controlled data but 4000 times makes it significant. I wouldn’t be surprised if it doubles or triples the amount of storage required.

With a façade

• Longer query times: Though the user only sends one query, it still must be executed on multiple shards, so it takes longer. Depending on implementation this can scale linearly with the number of shards. • Complicated façade: To keep pagination consistent and make it appear as if there is one FHIR server, the façade will need to do complicated query rewriting, forwarding and caching and track that between user requests. • Unintuitive failure of cross-code queries: in the façade-less model, queries asking for data across shards were clearly going to fail since the study ID is part of the shard name. However, now a query asking for a Patient resource that is connected to a ResearchSubject in Study 1 and one in Study 2 is a reasonable query that will always be empty – even if the user has permission to see all data from both studies. • Extra auth checks: The shards would need to check the authorization token. And the façade would need to check it to figure out which shard to send it to. With some review from our security committee, we might be able to get this down to one check.

C.2.2 Shard by user/access group

Let an access group be a group of codes that some user can access. One access group is the group that can access only public data. Each shard contains one access group’s data plus public data.

Without a façade

• Extra burden: The user needs to know whatever arcane coding system we come up with for the access group (because of limitations in HTTP header length, it won’t be readable). And if they change products, they will need that access group instead. • Exposes access group ids: This is a detail that should be hidden from users. They should be able to focus on their problem not our sharding. • Duplicates data: Now, any time data from two studies appears together, and they also appear apart, there will be a new access group. However, an examination of access from last year indicates that there are about 4000 active access groups. And that will duplicate all the data plus the public data. As a wild guess, I expect that the controlled data will be duplicated 5x-20x. And the public data will be duplicated 4000x.

With a façade

We are still considering this solution. The main problems are the scale of the number of servers and databases needed as well as updating many objects when we change the presentation. But the façade would be simple, and the users would see exactly what they’d expect to see in all queries and the savings in development time would be substantial. • Duplicates data: Same as without a façade. • Extra auth checks: One of the slowest parts of our system is the authorization checks. The shards would need to check the authorization token. And the façade would need to check it to figure out which shard to send it to. With some review from our security committee, we might be able to get this down to one check.

mingward commented 3 months ago

@jamesagnew Hi James, want to make sure you have had a chance to look at Eric's reply. And if you have any questions.

jamesagnew commented 3 months ago

Unfortunately, I have to be honest - I don't know that I have the time to absorb that much information as a part of community HAPI FHIR support. We would absolutely consider any concrete proposals / PRs provided that they included useful test cases, but I think we'd probably need to engage on a professional services contract if we wanted to co-design a solution of the scope being proposed above.

mingward commented 3 months ago

Thank you @jamesagnew . We understand. I have another question - there is also a paid version of Smile CDR HAPI JPA server, right? Does that version have this same issue? Thank you!

jamesagnew commented 3 months ago

Hi @mingward - Yup, we do have a commercial version and it uses the same underlying security infrastructure so I would expect it to work the same way currently.

mingward commented 3 months ago

Thanks @jamesagnew for the quick response. "commercial version and it uses the same underlying security infrastructure so I would expect it to work the same way currently." Do you have a way to test what you suspected? thank you!

jamesagnew commented 3 months ago

It does indeed work the same way.

hapifhir / hapi-fhir

Potential data security issue in HAPI JPA server AuthorizationInterceptor? #6065

Appendices

A: Inadequacy of Patient Compartments

B: If we could change the FHIR standard

C: Frequently Suggested Solutions

C.1: ConsentInterceptor

C.2: Sharding/Multitenancy

C.2.1 Shard by security code

Without a façade

With a façade

C.2.2 Shard by user/access group

Without a façade

With a façade