SearchableService canView()

emteknetnz commented 4 years ago

Issue: https://github.com/silverstripe/silverstripe-fulltextsearch/issues/252

Adds a canView() check on everything going in to a solr index and on most implementations will also check it on the way out too

This is PR no.2 for this issue, PR no.1 (merged) added a ShowInSearch check https://github.com/silverstripe/silverstripe-fulltextsearch/pull/275

This PR in a continuation of the work previously done in this PR https://github.com/silverstripe/silverstripe-fulltextsearch/pull/276

chillu commented 4 years ago

A final note: Shipping this change is going to produce manual work for every project that updates to fulltextsearch 3.7. The goal this PR strives for, of putting out any existing underground fires in sites, is likely worthy of this effort, but we need to be clear that this is an opt-out behaviour change in a minor release, and that those have burned projects before.

How do we ensure this important reminder finds its way into the highlevel CWP release announcements? So that's both the changelog and the blog post.

emteknetnz commented 4 years ago

A final note: Shipping this change is going to produce manual work for every project that updates to fulltextsearch 3.7. The goal this PR strives for, of putting out any existing underground fires in sites, is likely worthy of this effort, but we need to be clear that this is an opt-out behaviour change in a minor release, and that those have burned projects before.

How do we ensure this important reminder finds its way into the highlevel CWP release announcements? So that's both the changelog and the blog post.

@chillu - we've added a todo item to our future selves on the CWP 2.6. release issue "Work out an effective way to tell cwp users to 'run solr_reindex after composer update'"

emteknetnz commented 4 years ago

@chillu

Lookin' good, left a few comments.

Cheers!

Where did we land on with avoiding duplicate canView() checks, either because they're built into a different layer in CWP, or because they're in custom code? We should advise devs on how to avoid those.

SearchableService::cache is a memory-only cache specifically for that. The duplicate canView() don't happen when indexing (SearchableService->isIndexable($obj)), though they can easily happen when retrieving search results (SearchableService->isViewable($obj)) due to multiple search classes and flexibility in implementing those

I think the lowlevel functions (SolrService->addDocument()` etc) should have a warning that they won't check if a record is indexable (not using the service)

I don't think we need to do this. There are two indexing paths which cover all the indexing scenarios:

cms-save/$obj->write() - goes through SearchUpdateProcessor.php
solr_reindex - goes through - goes through SolrReindexBase.php

We do a SearchableService->isIndexable($obj) call in both those places

We could put a comment above ->addDocument saying that indexable happens in the SearchableService, though it's kind of out of context, by the time you're at the ->addDocument bit you're out of DataObject land and into XML? land I think, or whatever is used to communicate with solr

emteknetnz commented 4 years ago

I think we need to be clear what classes are low level (some assembly required) vs. high level (secure by default). We point out e.g. on DataList where you're expected to pass raw SQL and where it's sanitised. This might be a class-level rather than method-level comment, but in general the more clarity the better when it comes to security comms towards devs.

@chillu

I think we're now good here, most developers shouldn't need to change anything in their implementation after upgrading to 3.7 as it's pretty much secure by default now. Probably the main thing is to encourage people to run SolrReindex to purge old records, though sites are no worse of then they are now if they don't.

Inlet filtering is considered the primary layer of defense, and now anything that goes through SolrReindex or DataObject->write() will now be filtered.

Outlet filtering is considered the secondary layer of defense, and I think the majority of implementations will be filtered here out of the box. For those that are not, we have put a note in the readme "If your implementation is very custom and does not subclass nor make use of either SolrIndex or SearchForm, then it's recommended you update your implementation to call SearchableService::isViewable()."

emteknetnz commented 4 years ago

@chillu do we want to release this work as a minor or a new major? Rational for a new major would be that search results functionality may change (less results) after running composer update. Rational for keeping it minor is "more security for more sites"

chillu commented 4 years ago

I think we're now good here

OK, I don't understand the resistance to adding a sentence worth of phpdoc to a class, but I'm not going to labour that point, your call :)

do we want to release this work as a minor or a new major?

Ah that old chestnut. We absolutely need to get this into a minor release of the CWP recipe (via https://github.com/silverstripe/cwp-recipe-search). Otherwise the entire effort is worthless, since the point is to make current users of CWP 2.x recipe more secure by default.

On a module level, I'm in favour of a minor release. From what I can see, the "worst case scenario" for a site upgrading is that authorised users can't preview draft content in search results any more? I think that's very rare, and an acceptable change which we can explicitly call out in the upgrade guide. Developers can opt back into this behaviour if they feel it's the right risk/benefit tradeoff in their particular context.

I guess there might be sites which are filtering out draft content already, purposely or accidentally don't check canView on search results in their custom implementation, and expect those results to continue being included. Which seems like an extremely odd and dangerous way of implementing this business logic (compared to respecting canView and adjusting your implementation of that accordingly).

Unless somebody can come up with a worse case of potential regressions, I'd say this is a minor for security reasons. We've worked hard on this PR to ensure it's as backwards compatible and non-intrusive as me can make it, and provide extensive and clear upgrade instructions.

brynwhyman commented 4 years ago

@chillu I think another scenario worth considering here is existing sites who purposefully want search results to filter based on a permission check of the logged-in user, e.g. an intranet project has a 'Human Resources' parent page locked down to specific users who want to see those pages turn up in a search result.

I'd hope that we can make the possible regressions clear in the documentation and agree that this should go out as a minor. It's worth noting that this work was originally prioritised and planned to be developed as a 'type/bug' and security flaw in the module, not an enhancement.

emteknetnz commented 4 years ago

@brynwhyman there is two options to override the default behaviour of filtering on an anonymous member canView():

updateIsSearchable extension hook to add custom logic. You can do basically whatever you like here.
indexing_canview_exclude_classes configurable list of DataObjects to exclude from the canView() check on index. canView() will still be called when viewing results. If you add DataObject to this list then the search behaviour will be similar to how fulltextsearch worked previously

brynwhyman commented 4 years ago

I'd hope that we can make the possible regressions clear in the documentation

I just had another look at the PR... You can't miss this in the readme!

emteknetnz commented 4 years ago

Just smoke tested this to regain context since it's been a while (not really proper testing). It's mostly good, though seems there's one odd behaviour that we may want to correct

Good:

Draft-only (unpublished) content is not indexed
Pages with permissions (logged-in-users or above) is not indexed
Subsites seems to work as expected

Weird:

(unexpected) when you publish a Page, there are 2 records in the index, on for the draft variant, another for the live variant. They have identical content. Adding $this->excludeVariantState([SearchVariantVersioned::class => Versioned::DRAFT]); to the custom code MySolrIndex::init() resolved
(expected) when you modify this Page and click save (but not publish), the draft-variant is deleted, so there is only a live variant, which correctly has the content on the live site

It's a bit weird, seems like when you publish a record there should be no draft-variant by default

chillu commented 4 years ago

when you publish a Page, there are 2 records in the index, on for the draft variant, another for the live variant. They have identical content. Adding $this->excludeVariantState([SearchVariantVersioned::class => Versioned::DRAFT]); to the custom code MySolrIndex::init() resolved

So we're effectively doing twice the indexing work and using twice the index data due to this, unless devs remember to exclude this variant which we tell them is already excluded? That's not great. Also, the index entries basically "lie" about their contents now - it claims to be draft content, but actually contains published content. I think we need to resolve this without relying on customisations in a specific project. Good catch!

brynwhyman commented 4 years ago

It's a bit weird, seems like when you publish a record there should be no draft-variant by default

So the unexpected behaviour is due to a core issue?

Adding $this->excludeVariantState([SearchVariantVersioned::class => Versioned::DRAFT]); to the custom code MySolrIndex::init() resolved

Is this the clear approach to take for now? Ideally that unexpected behaviour would be resolved as part of this open PR so we can get this in the upcoming release in a good state.

chillu commented 4 years ago

So the unexpected behaviour is due to a core issue?

No, because the module still has the concept of "variants" (versioned, subsites), and I think the that variant state hasn't been excluded. So it's very much part of resolving the issue here. Adding custom code to the project Steve is testing on is no solution, it's a workaround that won't apply for everyone else. A potential "fix" could be to document this step as a customisation that everyone needs to apply to their own indexes (through changelogs), but I'd prefer for us to find a way that does this without requiring changes in every project.

silverstripe / silverstripe-fulltextsearch

SearchableService canView() #281