hapifhir / hapi-fhir

🔥 HAPI FHIR - Java API for HL7 FHIR Clients and Servers
http://hapifhir.io
Apache License 2.0
1.98k stars 1.3k forks source link

Pagination model not scalable #536

Closed RJFoster closed 4 years ago

RJFoster commented 7 years ago

The pagination model used by HAPI does not seem likely to scale well in real life situations. The paging model seems to require that information about all resources should be loaded when the search is run and stored somehow, then the subsequent paging requests return resources based on the original stored set. This does not work well due to unnecessary load on the server when the number of search results is large especially when Patient Privacy/consent/etc restrictions are involved.

In the simple high volume case a search over something like Observation could easily match many thousands of resources, it seems like an unnecessary load on the server if it has to retrieve and store some details (ids) of all those resources on the off chance that the client may want to access further pages.

In the privacy restrictions case, when some resources in a search result-set should not be returned to the requesting user due to privacy restrictions, it means the privacy restrictions need to be checked on all results (expensive!) when the search is performed in order to ensure that IBundleProvider.size() returns an accurate value. If an inaccurate value is returned by size it will mean that the bundle total will be inaccurately returned to the client and the next links may be missing. This may also lead to leakage to the client that some privacy filtering has gone on as the total and actual number retrieved will not match.

Is it possible for future versions of HAPI to provide a more customizable pagination model than it currently does?

RJFoster commented 7 years ago

Also the FHIR spec says that caching of previous pages should not be required, however HAPI's model does require this, see https://www.hl7.org/fhir/search.html#2.1.1.5.6

2.1.1.5.6 Paging When returning paged results for a search with _include resources, all _include resources that are related to the primary resources returned for the page SHOULD also be returned as part of that same page, even if some of those resource instances have previously been returned on previous pages. This approach allows both sender and receiver to avoid caching results of other pages.

storni commented 6 years ago

Hi @RJFoster, did you find a workaround/improvement for this?

BTW, I've found a way to improve the search result intermediate storage (#866).

jamesagnew commented 6 years ago

I'm not sure I understand the issue.

The IBundleProvider interface does not mandate that implementations store the entire search result. Certainly the default implementation does, but technically all the implementation needs to do is be able to provide a UUID that uniquely identifies the search and could be used to later fetch more results via the IPagingProvider.

If you wanted to achieve something more scalable than the FifoMemoryPagingProvider you would need to design a system that remembers the search criteria for a specific search and can recall them by UUID. This is exactly how the DatabaseBackedPagingProvider in the JPA server works.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sqshq commented 4 years ago

all the implementation needs to do is be able to provide a UUID that uniquely identifies the search

@jamesagnew that makes requests stateful, which in general means you need an additional infrastructure (e.g. distributed cache) to store the request id -> request params mapping in multi-node environment, or use sticky sessions etc.

It complicates implementation, because for simple cases it is ok to fetch each page from the database. That is mentioned on the FHIR paging documentation

Simply keeping initial search parameters for prev/next links would allow to do that. Also UUID is unnecessary in that case.

jamesagnew commented 2 years ago

Note to anyone reading this ticket - If 100% stable search results are not a requirement in your specific deployment, you do not need to rely on the query cache. The ForceOffsetSearchModeInterceptor can be used to always use stateless searches thanks to #2059 and related tickets, and this greatly improves search performance at the cost of search results not being guaranteed to be stable even over multiple hours of paging activity (the advantage of the default behaviour).