patrick-austin commented 2 years ago

Changes

Added getReadableIds function. The logic here is the same as in getReadable, however we accept and return lists of IDs rather than entities. When dealing with the free text search results, we only have IDs. Before we were finding the bean for each ID in turn in order to run authorisation on it, then proceeding to return the ID. But as we already have the klass of the entitity and the ID from Lucene we can simply operate on these instead.

Instead of hardcoding the number of results to get from Lucene at 1000, instead use the maxIdsInQuery setting. This is the limit we would send in one batch to the DB in getReadable/getReadableIds anyway, and the default value of 500 and max for Oracle of 1000 is the same order of magnitude as what we were currently using.

Impact

I ran a number of searches against my local development components, using the lils.yml as my dummy data. This contains:	Investigations	Datasets	Datafiles
100	300	29967

I searched for * (returning all entities before authorising) as both root and the non-root icatuser, representing best and worst case scenarios (all data visible, no data visible and needing to check DB for rules) for the old and new setup with maxIdsInQuery set to 500 and 1000. All times are in ms, and is measured between the first EntityBeanManager - Got X results from Lucene and EntityBeanManager - Returning Y results logs made by icat.server.

root

Setup	Investigations	Datasets	Datafiles
Old	91	205	212
New (500)	15	11	12
New (1000)	13	12	11

Old method is comparable for DS and DF since we only loop until we reach 300 accepted results, then return. New only submits one batch of entities to check, and returns in consistently ~12ms.

non-root

Setup	Investigations	Datasets	Datafiles
Old	215	337	29261
New (500)	21	14	1502
New (1000)	15	14	690

Old method takes around 1 to 2 ms per entity checked as we need to check every ID individually before being sure none are authorised. New method takes roughly the same amount of time for Investigations and DS as these both require a single batch of IDs (less than the limit in settings. Datafiles takes around half the time when the limit is doubled to 1000, as we only need to send half as many batches (averaging 24ms per batch).

Also searched for a specific single result using `id:...` in both old and new setup as root:	Setup	Investigations	Datasets	Datafiles
Old	17	14	16
New	11	15	14

Times are comparable with old and new approaches. For the new method, the time taken to authorise 1 result and hundreds (as in the * example) was also comparable.

It's worth noting that in these tests I didn't actually have any rules in the DB to evaluate.

patrick-austin commented 2 years ago

Following discussion with @kevinphippsstfc, have:

Added explicit config option lucene.searchBlockSize instead of piggy backing maxIdsInQuery
Refactored duplicated code in GateKeeper
Added HasEntityId interface so that the getReadableIds function can run on either EntityBaseBean or ScoredEntityBaseBean without needing to loop to get their id
Changed logic in filterReadAccess so we add all newResults if we have no restrictions, and only iterate and check against the id set if restrictions were present. Should also now throw if we're over maxEntities.

This shouldn't change the performance of existing usage of getReadable too much, as we do not introduce any additional loops, only refactoring some code and adding the if (restrictions == null) check to enable early returns of all beans. Further improvements or changes for clarity are welcome.

patrick-austin commented 2 years ago

Overall, this looks great, thanks! I've just requested a few additions to the comments to hopefully help future developers.

I'm wondering how best to test this given that I don't think the part of the code using "restrictions" will get tested much (if at all?) by the unit and integration tests. We could do with testing it on an ICAT containing some of the usual InvestigationUser and InstrumentScientist rules/restrictions.

Have added/expanded the documentation of the new/modified functions, if anything still needs clarifying just let me know.

patrick-austin commented 1 year ago

Have expanded TestRS to ensure it covers these changes. These changes will impact:

All Lucene searches
Get and Search requests
MD exports

Note that the last two are only affected if an INCLUDE query for a collection is used. For example, Facility INCLUDE InvestigationType will run the changed code as there are multiple types at a facility but Investigation INCLUDE InvestigationType will not, as there is only one per investigation.

Some implicit testing was taking place, as if the authorization were overly restrictive (i.e. if getReadableIds is hacked to return no ids) then the following tests would already fail:

TestRS.testGet
TestRS.testLuceneDatafiles
TestRS.testLuceneDatasets
TestRS.testLuceneInvestigations
TestRS.testSearch

along with some tests in TestWS (as the functionality in question is in Gatekeeper, I only changed TestRS as part of this PR).

I have expanded these tests to also assert that the tests are not too permissive, by performing searches as the piOne user who cannot see anything unless new rules are set.

The existing exportMetaDataQuery is marked with @Ignore, so I created a new test which performs both "permissive" and "restrictive" checks on the inclusion of related entities in the export dump.

Also refactored code to set up root and piOne sessions, and some automatic formatting of the TestRS file. Note that this branch does not have fixes for the unlrelated test failures: #275 #268

icatproject / icat.server

Add getReadableIds and use in filterReadAccess #277 #278

Changes

Impact

root

non-root