silverstripe / silverstripe-fulltextsearch

Adds external full text search engine support to Silverstripe
BSD 3-Clause "New" or "Revised" License
44 stars 83 forks source link

Don't index records where canView=false for anonymous users #252

Closed chillu closed 4 years ago

chillu commented 5 years ago

The module denies access to a search result if you can’t view the underlying record: https://github.com/silverstripe/silverstripe-fulltextsearch/blob/f89818909a0d0c08f7d8f05217eb62e32988aa20/src/Solr/Forms/SearchForm.php#L89. It does however index all records, incl. ones with canView=false. The canView check is implemented in SearchForm->getResults(), which is a commonly subclassed method - so still a residual risk that devs will skip that check.

I think we should default to not indexing records where canView() returns false for anonymous users. You could overwrite this behaviour, e.g. to include silverstripe/secureassets in 3.x, or protected/private files in 4.x. But that would be a conscious decision, with the module staying secure by default.

We'd need to do this in a backwards compatible way (not breaking existing searches which rely on the absence of these checks). Although you could argue that those implementations rely on a bug?

When releasing this, we should strongly recommend to search users that they run a reindex to prune any records with canView=false in their current search state.

ACs

Removed ACs

Subtasks

PRs

robbieaverill commented 5 years ago

Yeah I think it'd make sense to treat this as a bug and fix it in a patch release to do as you've suggested.

chillu commented 5 years ago

Looking at https://github.com/silverstripe/silverstripe-fulltextsearch/blob/master/tests/SearchVariantVersionedTest.php#L89 and https://github.com/silverstripe/silverstripe-fulltextsearch/blob/master/tests/SolrIndexVersionedTest.php#L167, it already deletes the "live" variant on unpublish, and the "draft" variant on a "delete/archive".

And I've confirmed that an index operation is triggered on publish (https://github.com/silverstripe/silverstripe-fulltextsearch/blob/master/docs/en/03_configuration.md#publish-a-page-in-the-cms):

Every change, addition or removal of an indexed class instance triggers an index update through a "processor" object. The update is transparently handled through inspecting every executed database query and checking which database tables are involved in it.

Publication is only one way to change the return value of canView() though. Other common ways are to limit permissions through InheritedPermissions, which by default works on files and pages in 4.x. Since this is based on inheritance, the records affected can be nested, and don't receive a direct write event. Then there's custom canView() implementations, where any line of PHP code changed technically invalidates the existing index state - it's not deterministic.

I think we have to be careful with claiming that records protected through canView aren't indexed - it's only part of a layered security model, with actual canView checks enforced on search operations as the main defence. For draft content, we can cover most use cases though.

In terms of transitioning existing projects, we could change this default, but it wouldn't fix existing indexes until every indexed record is written again, giving the indexer an opportunity to remove the draft stage. We could recommend a reindex, which auto-cleans existing records. But my understanding from https://github.com/silverstripe/silverstripe-fulltextsearch/blob/master/src/Solr/Reindex/Handlers/SolrReindexBase.php is that it'll clear out all records before reindexing. Since a reindex can sometimes take days, that'll impact search availability on production sites. While that's an unrelated limitation of the module, it makes a reindex recommendation not entirely realistic. As an alternative, we could implement a task to clear out all records across certain variant states. dev/tasks/Solr_Reindex variantstate=["Draft"] might work for this, but it's unclear how different variant states interact, e.g. subsites.

Firesphere commented 5 years ago

This is not an "easy" task.

As search goes, it needs to be aware of the current member permissions to view/edit/create/delete. For the frontend, this would only include view.

This means, a new stored multivalued field needs to be added, adding a canView for every member and group. The granular access that SilverStripe provides can not be taken to such a broad coverage.

The solution to this would be either store in the index, or keep the current validation of canView after searching.

This poses a problem. A Member with full access, would see all results, even if the results are limited to 1000 records from the database. However, considering the case where there are 1 million rows, but the current user can only see the last 2 of the million rows, a search at the moment returns endless empty pages, or no results at all (due to the artificial limit of 100). Thus, a user that would be allowed to see two search results, will see none due to the limit.

There is no clear cut solution here, but there is definitely no easy path either.

canView is not a reason to exclude items from being indexed, especially since an indexing runs as root/admin and therefore, canView would return true.

The current check does work, but is not ideal. A more ideal solution is to include canView in the index, as an array of items (multivalued), where a filtered query can tell Solr to only include the results that have either currentMember:canView-True or currentMemberGroup:canView-True set.

On large databases, this is such a heavy operation, that it is not an option.

Alternatively, the module can check each returned item, and if the amount of items is less than the amount of viewable items for the current user, execute the query again with a new offset and append the items the user can view. This is a memory heavy operation however, and also not ideal either.

robbieaverill commented 5 years ago

The effort/easy label was added according to this statement, which I agree with:

I think we should default to not indexing records where canView() returns false for anonymous users.

Without synchronising ACLs between the application and search engine itself there's no easy way to correctly limit by permissions from the backend, so we need to rely on the frontend to do it instead. This will mean that pagination could easily be inconsistent, but something we need to accept for now.

Edit: to be fair, synchronising ACLs isn't really a feasible solution anyway, since SilverStripe's permission checking system is designed to be dynamic. I could write canView() { return rand(1, 10) > 5; } if I wanted to.

chillu commented 5 years ago

@Firesphere I think you've got the wrong end of the stick here. I'm not proposing that we're implementing permission control inside the search index, just that we don't index records where canView=false for anonymous users. If we implement this, you still might not have access to all records in the search index, depending on your permissions. That's fine, those checks should be done through canView() in the search results presentation (current behaviour).

The big assumption here is that canView() with an anonymous user is deterministic, or at least based on data directly on the record. Since a reindex check is performed on every save operation, any change to record data influencing canView() will be accounted for. If your canView() is hooked up to something external, it would usually check a specific member record (e.g. LDAP group membership), and otherwise return false. Can you think of a case where canView=false for anonymous users is not deterministic? Even if that's the case, the canView() check in search results presentation is the second (and main) line of defense.

The aim of this card is to have less data in the index that likely shouldn't be there, and reduce the impact of other defenses failing (e.g. a custom search results presentation missing a canView() check). The aim is not to create a perfect representation of access permissions within the search index data itself.

If you want permission controlled data in the index (where canView=false for anonymous users), then you can opt in to that, and consider the risks and mitigations you need to put in place for your specific project. For example, that might be common for intranets on public networks (domain is accessible, but all content is permission checked).

chillu commented 4 years ago

Just to be clear, this should apply to every user of the fulltextsearch module, but in particular on the CWP customisations on top of the baseline.

chillu commented 4 years ago

Suggested ACs:

chillu commented 4 years ago

Notes on priorities for this implementation in https://github.com/silverstripe/silverstripe-fulltextsearch/issues/257#issuecomment-605717864

emteknetnz commented 4 years ago

Notes about this AC "InheritedPermissions is leveraged to reduce performance impact"

I'm assuming it's referring to this https://github.com/silverstripe/silverstripe-framework/blob/4/src/Security/InheritedPermissions.php#L227

My understanding of this functionality is that it does large efficient SQL queries instead of multiple smaller ones, and importantly it bypasses any custom code subclass canView() checks e.g. function canView() { return $this->ID % 2; }

The motivation for wanting to use this function is performance at the expense of accuracy

I wouldn't feel comfortable using InheritedPermissions at both inlet (solr_reindex) AND outlet (search-results) filtering, since you'd need a 'real' canView() check in at least one of those places, so we'd need to choose which one:

Here's a couple of examples on relying on db-only InheritedPermissions "getting it wrong"

UX issue - page is not index when it should be: Page has some group permissions on it MyPageType .. canView() { return true; }

Security issue - page is indexed when it should not be: Page has no permissions (inherit) MyPageType .. canView() { return $member->inGroup(‘some-group’) || $member->isAnAdmin() }

^ I've seen plenty of instances of websites with canView() .. $member->inGroup() type of checks before

Here's the implications of putting InheritedPermissions on each filtering stage

Inlet filtering

Outlet filtering

@chillu

Seems as though InheritedPermissions on outlet filtering isn't advisable. This leaves us with a couple of options: a) implement it on inlet filtering, get a faster solr_reindex, accept that some search results that should not be indexed may still be indexed b) not implement it at all, get a slower solr_reindex, fully meet original goal of this issue 'don't index records where canView=false for anonymous users"

My vote is on b) ... let me know what your thoughts are

Firesphere commented 4 years ago

Are you... still... discussing... this?....?

/me, off

chillu commented 4 years ago

Yep I think b) is the way go to, thanks for working through this Steve!

emteknetnz commented 4 years ago

Compare links for recent squash pushes for branches: silverstripe/fulltextsearch https://github.com/silverstripe/silverstripe-fulltextsearch/compare/ffbb13f6808fea402dea978d3c11198a3b278e7f..cfe937fbd1c4103ed934051ab20d56f7467445ee cwp/cwp https://github.com/silverstripe/cwp/compare/6942aea32825c7a1daa8b64b2702ee4788a97326..ebcad8924d42bb825c73dffe68ab2a663309004c

dnsl48 commented 4 years ago

silverstripe/fulltextsearch https://github.com/silverstripe/silverstripe-fulltextsearch/compare/ffbb13f6808fea402dea978d3c11198a3b278e7f..cfe937fbd1c4103ed934051ab20d56f7467445ee

Codewise the patch looks good to me. I didn't test it yet though.

brynwhyman commented 4 years ago

Received a message from @dnsl48 internally. We're happy with the code changes and unit test coverage, but are yet to do a final-final test. We'll look to tag an RC for this module, to be included in the CWP release candidate while we close out the testing. Any critical issues found during testing will be resolved prior to the stable tag being released for this module.

Cheddam commented 4 years ago

Final internal testing has been completed against 3.7.0-rc1 with the CWP Solr index.