silverstripe / silverstripe-fulltextsearch

Adds external full text search engine support to Silverstripe
BSD 3-Clause "New" or "Revised" License
44 stars 83 forks source link

Don't index records with ShowInSearch=false #257

Closed chillu closed 4 years ago

chillu commented 5 years ago

The module makes no assumption about what criteria might be applied for records to be added to the index. I'm proposing elsewhere that we filter out records with anonymous canView=false.

In addition, I think we should add behavior to prevent indexing of a record based on custom, model specific flags like ShowInSearch. Currently, the docs mention this usage pattern (e.g. https://www.cwp.govt.nz/developer-docs/en/2/features/solr_search/default_search_index/), and recommend adding this as a filter field. You're still in charge of filtering on that field by default. The CWPSearchEngine performs this filter out of the box, but only on SiteTree and File (see pull request). If you're defining your own index on other objects, or include other objects in that index, you need to create your own search engine - the filter field is indexed, but it's ignored during searches. The docs for searching dataobjects fail to mention that as well. This illustrates why we shouldn't have those records in the index in the first place.

This has some of the same rollout issues as filter out records with anonymous canView=false, we'd need to advise devs on how to get those existing index entries with ShowInSearch=false out of their index.

When releasing this, we should strongly recommend to search users that they run a reindex to prune any records with ShowInSearch=0 in their current search state.

Another thing to check: when we change ShowInSearch from 1 to 0 and click save/publish, does it get removed from the index straight away?

ACs

Note:

We still have to index the ShowInSearch property, otherwise search request filtering on existing implementations will fail

PRs:

https://github.com/silverstripe/silverstripe-fulltextsearch/pull/275 https://github.com/silverstripe/cwp/pull/260 https://github.com/silverstripe/cwp-search/pull/31

Subtasks

Links

https://silverstripeltd.slack.com/archives/CLXKD9X51/p1584670614347800

emteknetnz commented 4 years ago

Ensuring that ShowInSearch is a filter field that is widely available

Update:


(note: this is not the same as ensuring that ShowInSearch=0 record never get into the solr index)

Inheritance is:

class CwpSolrIndex extends CwpSearchIndex
abstract class CwpSearchIndex extends SolrIndex
abstract class SolrIndex extends SearchIndex
absract class SearchIndex extends ViewableData

CwpSolrIndex->init() { ... $this->addFilterField('ShowInSearch'); }

We could write the follow code to ensure this behaviour happens on across the board

SearchIndex->__construct() { ... $this->addFilterField('ShowInSearch'); }
emteknetnz commented 4 years ago

Creating a new page is the CMS and saving it will commit it to solr index immediately. You don't even get given the chance to go to the Settings tab and untick Show In Search?" as you'll get a browser warning that you're navigating away from a form without saving it.

In the :8983/solr index it will have

        "SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": true,
        "_versionedstage": "Stage",

When go to Settings and untick "Show in Search?' i get

        "SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
        "_versionedstage": "Stage",

When I then click publish I get now get a 2nd entry

1.
        "SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
        "_versionedstage": "Stage",
2. (new entry)
        "SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
        "_versionedstage": "Live",
emteknetnz commented 4 years ago

Excluding DataObject with ShowInSearch = false from every getting into the index:

SearchUpdateProcessor:86 -

// Ensure that indexes for all new / updated objects are included
$objs = DataObject::get($base)->byIDs(array_keys($ids));
foreach ($objs as $obj) {
    foreach ($ids[$obj->ID] as $index) {
        if (!$indexes[$index]->variantStateExcluded($state)) {
// new-code-start
            if (isset($obj->ShowInSearch) && !$obj->ShowInSearch && $obj->ID) {
                $indexes[$index]->delete($base, $obj->ID, $state);
                $indexes[$index]->commit();
                continue;
            }
// new-code-end
            $indexes[$index]->add($obj);
            $dirtyIndexes[$index] = $indexes[$index];
        }
    }
    unset($ids[$obj->ID]);
}

Stacktrace for updating solr index after click 'Save' in the CMS

Solr4Service_Core.php:35, SilverStripe\FullTextSearch\Solr\Services\Solr4Service_Core->addDocument()
SolrIndex.php:591, CWP\Search\Solr\CwpSolrIndex->_addAs()
SolrIndex.php:608, CWP\Search\Solr\CwpSolrIndex->add()
SearchUpdateProcessor.php:90, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->prepareIndexes()
SearchUpdateProcessor.php:142, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->process()
SearchUpdateImmediateProcessor.php:9, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->triggerProcessing()
SearchUpdater.php:188, SilverStripe\FullTextSearch\Search\Updaters\SearchUpdater::flush_dirty_indexes()
chillu commented 4 years ago

Yes, I'm not interested in solutions which index records that have ShowInSearch=false and then filter them out somehow. That's a bug, we need to fix it.

chillu commented 4 years ago

Excluding DataObject with ShowInSearch = false from every getting into the index:

The code you added there seems to be aimed at deleting records with ShowInSearch=0 after they've been added already. I think there's two cases to cover: 1. A reindex for this object type should remove those records automatically (clean up after bad defaults), 2. new objects with ShowInSearch=0 shouldn't get into the index in the first place.

Also, careful with doing commit() in a loop. The only reason that doesn't wreck search servers is that we actually ignore those commands in CWP, but on SC that's a bad idea usually (read up on autoSoftCommit, it's a bit of a rabbit hole)

chillu commented 4 years ago

Proposed ACs:

Note:

emteknetnz commented 4 years ago

@chillu I'm believe we do actually need to delete the individual record from the index whenever it is updated from ShowInSearch=1 to ShowInSearch=0. We cannot rely solely on developers running a reindex

If we do not delete it when it is updated, the following scenario happens:

So we haven't really solved much unless we delete records from the solrindex whenever ShowInSearch is set to 0

Does this make sense?

chillu commented 4 years ago

The ShowInSearch value is inspected every time a record is saved (non-versioned) or published (versioned), and the index entry is updated accordingly

Sorry, what I meant by this is "that's reflected by adding to or removing from the index"

emteknetnz commented 4 years ago

getShowInSearch() on inlet works as expected, the magic was well crafted back in the day

app/src/Page.php

public function getShowInSearch()
{
  // behaves as expected, "overrides" the SiteTree.ShowInSearch DB column
  // we can safely assume this will work in the absense of a <dataobject>.ShowInSearch column
}

public function getABC()
{
  return false;
}

Debugging within SearchUpdateProcessor:

isset($obj->ABC); // true
$obj->ABC; // false
isset($obj->ABCD); // false
$obj->ABCD; // null
emteknetnz commented 4 years ago

Seems like "outlet" filtering on the solr query isn't easily viable. Reason being that the filtering on the 'ShowInSearch' column requires passing in the full namespace of the column

In the solr index, it stored as

"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false

To filter this, I must add an 'fq' value of 'SilverStripe\CMS\Model\SiteTree_ShowInSearch:false' -- I cannot simply filter using a partial field name e.g. 'ShowInSearch:false' or '_ShowInSearch:false'

Using the search box on the simple theme, the queries passed to solr are pretty generic:

params={
    spellcheck=true&
    start=0&
    q=%2Bmysearchterm&
    spellcheck.dictionary=_spellcheck&
    json.nl=map&wt=json&
    spellcheck.collate=true&
    fq=%2B(_versionedstage:"Live"+(*:*+-_versionedstage:[*+TO+*]))&
    rows=10
}

To filter ShowInSearch on the solr query, we'd need to know all the relevant DataObjects ahead of time. This just does not seems like a good approach so I'll look to do this elsewhere.

emteknetnz commented 4 years ago

This seems like a more viable place to do outlet filtering:

SolrIndex.php:769

        $res = $service->search(
            $q ? implode(' ', $q) : '*:*',
            $offset,
            $limit,
            $params,
            \Apache_Solr_Service::METHOD_POST
        );

        $results = new ArrayList();
        if ($res->getHttpStatus() >= 200 && $res->getHttpStatus() < 300) {
            foreach ($res->response->docs as $doc) {
                $result = DataObject::get_by_id($doc->ClassName, $doc->ID);
                if ($result) {
// new-code-start
                    if (isset($result->ShowInSearch) && !$result->ShowInSearch) {
                        continue;
                    }
// new-code-end
                    $results->push($result);

this also has the added benefit of calling ->getShowInSearch()

The only drawback with this approach is that it's on SolrIndex.php instead its parent class SearchIndex.php, though as previously discussed, we're assuming fulltextsearch implementations to be close to 100% solr4. I don't feel as though there's a justification to make the effort to get this on to SearchIndex.php

Just to clarify what's being called and to show that we can't get any "lower"

SolrIndex.php:697

public function search(...)
{
  // $service == Solr4Service_Core which is a fulltextsearch class
  $service = $this->getService(); 
  ...

  // this calls Apache_Solr_Service->search(), which is 3rd party module class and is
  // the parent class of Solr4Service_Core. 
  // At this point i'll make a request to the solr server
  $service->search(...);  
}
emteknetnz commented 4 years ago

Add a warning label about creating non-deterministic getShowInSearch() implementations

emteknetnz commented 4 years ago

Outlet filtering for Solr_Reindex (different path from cms->save())

The following go through the same code path:

SolrReindexBase.php:266

protected function getRecordsInGroup(...
...
        $item = ... get some dataobjects
...
// new-code-start
        // ShowInSearch filter
        // we cannot use $items->remove($item), as that deletes the record from the db
        $ids = $items->sort('ID')->column('ID');
        $ids = array_fill_keys($ids, true);
        foreach ($items as $item) {
            if (isset($item->ShowInSearch) && !$item->ShowInSearch) {
                unset($ids[$item->ID]);
            }
        }
        $items = DataList::create($class)->filter(['ID' => array_keys($ids)]);
// new-code-end
emteknetnz commented 4 years ago

Using pages created using frameworktest with there's no negative impact of code in the pull-request on index time for doing a solr_reindex on my local (vagrant). It actually runs a bit faster.

fulltextsearch 3.6 2m 40s

fulltextsearch 3.6 branch 2m 25s

chillu commented 4 years ago

Yeah I wouldn't have expected a performance impact, since it already has that field in memory. The performance impact will be more pronounced with canView

chillu commented 4 years ago

Had a good call with Steve, here's some notes.

Principles (descending order of priority):

  1. Minimise the possibility of new search implementations with security issues
  2. Minimise additional complexity for devs to understand on top of already complex search API
  3. Focus documentation on safe highlevel APIs, make it clear when "advanced" low-level APIs are referenced (and clarify dev responsibilities)
  4. Clearly communicate changes and sanity checks to existing search implementations
  5. Minimise performance impact on indexing
  6. Minimise performance impact on search

Observations: Steve start:

Steve end

Ingo's opinion:

emteknetnz commented 4 years ago

Have added two cucumber studio tests for ShowInSearch filtering https://studio.cucumber.io/projects/161118/test-plan/folders/1193287