Closed chillu closed 4 years ago
Ensuring that ShowInSearch is a filter field that is widely available
Update:
$query->addExclude(SiteTree::class . '_ShowInSearch', 0);
on all relevant DataObjects(note: this is not the same as ensuring that ShowInSearch=0 record never get into the solr index)
Inheritance is:
class CwpSolrIndex extends CwpSearchIndex
abstract class CwpSearchIndex extends SolrIndex
abstract class SolrIndex extends SearchIndex
absract class SearchIndex extends ViewableData
CwpSolrIndex->init() { ... $this->addFilterField('ShowInSearch'); }
We could write the follow code to ensure this behaviour happens on across the board
SearchIndex->__construct() { ... $this->addFilterField('ShowInSearch'); }
Creating a new page is the CMS and saving it will commit it to solr index immediately. You don't even get given the chance to go to the Settings tab and untick Show In Search?" as you'll get a browser warning that you're navigating away from a form without saving it.
In the :8983/solr index it will have
"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": true,
"_versionedstage": "Stage",
When go to Settings and untick "Show in Search?' i get
"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
"_versionedstage": "Stage",
When I then click publish I get now get a 2nd entry
1.
"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
"_versionedstage": "Stage",
2. (new entry)
"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false,
"_versionedstage": "Live",
Excluding DataObject with ShowInSearch = false from every getting into the index:
SearchUpdateProcessor:86 -
// Ensure that indexes for all new / updated objects are included
$objs = DataObject::get($base)->byIDs(array_keys($ids));
foreach ($objs as $obj) {
foreach ($ids[$obj->ID] as $index) {
if (!$indexes[$index]->variantStateExcluded($state)) {
// new-code-start
if (isset($obj->ShowInSearch) && !$obj->ShowInSearch && $obj->ID) {
$indexes[$index]->delete($base, $obj->ID, $state);
$indexes[$index]->commit();
continue;
}
// new-code-end
$indexes[$index]->add($obj);
$dirtyIndexes[$index] = $indexes[$index];
}
}
unset($ids[$obj->ID]);
}
Stacktrace for updating solr index after click 'Save' in the CMS
Solr4Service_Core.php:35, SilverStripe\FullTextSearch\Solr\Services\Solr4Service_Core->addDocument()
SolrIndex.php:591, CWP\Search\Solr\CwpSolrIndex->_addAs()
SolrIndex.php:608, CWP\Search\Solr\CwpSolrIndex->add()
SearchUpdateProcessor.php:90, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->prepareIndexes()
SearchUpdateProcessor.php:142, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->process()
SearchUpdateImmediateProcessor.php:9, SilverStripe\FullTextSearch\Search\Processors\SearchUpdateImmediateProcessor->triggerProcessing()
SearchUpdater.php:188, SilverStripe\FullTextSearch\Search\Updaters\SearchUpdater::flush_dirty_indexes()
Yes, I'm not interested in solutions which index records that have ShowInSearch=false and then filter them out somehow. That's a bug, we need to fix it.
Excluding DataObject with ShowInSearch = false from every getting into the index:
The code you added there seems to be aimed at deleting records with ShowInSearch=0 after they've been added already. I think there's two cases to cover: 1. A reindex for this object type should remove those records automatically (clean up after bad defaults), 2. new objects with ShowInSearch=0 shouldn't get into the index in the first place.
Also, careful with doing commit()
in a loop. The only reason that doesn't wreck search servers is that we actually ignore those commands in CWP, but on SC that's a bad idea usually (read up on autoSoftCommit
, it's a bit of a rabbit hole)
Proposed ACs:
Note:
@chillu I'm believe we do actually need to delete the individual record from the index whenever it is updated from ShowInSearch=1 to ShowInSearch=0. We cannot rely solely on developers running a reindex
If we do not delete it when it is updated, the following scenario happens:
So we haven't really solved much unless we delete records from the solrindex whenever ShowInSearch is set to 0
Does this make sense?
The ShowInSearch value is inspected every time a record is saved (non-versioned) or published (versioned), and the index entry is updated accordingly
Sorry, what I meant by this is "that's reflected by adding to or removing from the index"
getShowInSearch() on inlet works as expected, the magic was well crafted back in the day
app/src/Page.php
public function getShowInSearch()
{
// behaves as expected, "overrides" the SiteTree.ShowInSearch DB column
// we can safely assume this will work in the absense of a <dataobject>.ShowInSearch column
}
public function getABC()
{
return false;
}
Debugging within SearchUpdateProcessor:
isset($obj->ABC); // true
$obj->ABC; // false
isset($obj->ABCD); // false
$obj->ABCD; // null
Seems like "outlet" filtering on the solr query isn't easily viable. Reason being that the filtering on the 'ShowInSearch' column requires passing in the full namespace of the column
In the solr index, it stored as
"SilverStripe\\CMS\\Model\\SiteTree_ShowInSearch": false
To filter this, I must add an 'fq' value of 'SilverStripe\CMS\Model\SiteTree_ShowInSearch:false' -- I cannot simply filter using a partial field name e.g. 'ShowInSearch:false' or '_ShowInSearch:false'
Using the search box on the simple theme, the queries passed to solr are pretty generic:
params={
spellcheck=true&
start=0&
q=%2Bmysearchterm&
spellcheck.dictionary=_spellcheck&
json.nl=map&wt=json&
spellcheck.collate=true&
fq=%2B(_versionedstage:"Live"+(*:*+-_versionedstage:[*+TO+*]))&
rows=10
}
To filter ShowInSearch on the solr query, we'd need to know all the relevant DataObjects ahead of time. This just does not seems like a good approach so I'll look to do this elsewhere.
This seems like a more viable place to do outlet filtering:
SolrIndex.php:769
$res = $service->search(
$q ? implode(' ', $q) : '*:*',
$offset,
$limit,
$params,
\Apache_Solr_Service::METHOD_POST
);
$results = new ArrayList();
if ($res->getHttpStatus() >= 200 && $res->getHttpStatus() < 300) {
foreach ($res->response->docs as $doc) {
$result = DataObject::get_by_id($doc->ClassName, $doc->ID);
if ($result) {
// new-code-start
if (isset($result->ShowInSearch) && !$result->ShowInSearch) {
continue;
}
// new-code-end
$results->push($result);
this also has the added benefit of calling ->getShowInSearch()
The only drawback with this approach is that it's on SolrIndex.php instead its parent class SearchIndex.php, though as previously discussed, we're assuming fulltextsearch implementations to be close to 100% solr4. I don't feel as though there's a justification to make the effort to get this on to SearchIndex.php
Just to clarify what's being called and to show that we can't get any "lower"
SolrIndex.php:697
public function search(...)
{
// $service == Solr4Service_Core which is a fulltextsearch class
$service = $this->getService();
...
// this calls Apache_Solr_Service->search(), which is 3rd party module class and is
// the parent class of Solr4Service_Core.
// At this point i'll make a request to the solr server
$service->search(...);
}
Add a warning label about creating non-deterministic getShowInSearch() implementations
Outlet filtering for Solr_Reindex (different path from cms->save())
The following go through the same code path:
SolrReindexBase.php:266
protected function getRecordsInGroup(...
...
$item = ... get some dataobjects
...
// new-code-start
// ShowInSearch filter
// we cannot use $items->remove($item), as that deletes the record from the db
$ids = $items->sort('ID')->column('ID');
$ids = array_fill_keys($ids, true);
foreach ($items as $item) {
if (isset($item->ShowInSearch) && !$item->ShowInSearch) {
unset($ids[$item->ID]);
}
}
$items = DataList::create($class)->filter(['ID' => array_keys($ids)]);
// new-code-end
Using pages created using frameworktest with there's no negative impact of code in the pull-request on index time for doing a solr_reindex on my local (vagrant). It actually runs a bit faster.
fulltextsearch 3.6 2m 40s
fulltextsearch 3.6 branch 2m 25s
Yeah I wouldn't have expected a performance impact, since it already has that field in memory. The performance impact will be more pronounced with canView
Had a good call with Steve, here's some notes.
Principles (descending order of priority):
Observations: Steve start:
Core canView() outlet filtering check done in base fulltextsearch module
SearchForm
- however this is not used by default in CWPCWP canView() outlet filtering checks:
CwpSearchResult
constructorCwpSearchEngine
uses CwpSearchResultSearchControllerExtension
uses CwpSearchEngineBasePageController
has SearchControllerExtension applied to itcanView() checks aren't mentioned in CWP or base module docs (only on CWP in context of "Searching DataObjects")
CWP docs reference a lowlevel API (SolrIndex->search()) in the basic docs (https://www.cwp.govt.nz/developer-docs/en/2/features/solr_search/basic_usage/)
Steve end
Ingo's opinion:
Have added two cucumber studio tests for ShowInSearch filtering https://studio.cucumber.io/projects/161118/test-plan/folders/1193287
The module makes no assumption about what criteria might be applied for records to be added to the index. I'm proposing elsewhere that we filter out records with anonymous canView=false.
In addition, I think we should add behavior to prevent indexing of a record based on custom, model specific flags like
ShowInSearch
. Currently, the docs mention this usage pattern (e.g. https://www.cwp.govt.nz/developer-docs/en/2/features/solr_search/default_search_index/), and recommend adding this as a filter field. You're still in charge of filtering on that field by default. The CWPSearchEngine performs this filter out of the box, but only onSiteTree
andFile
(see pull request). If you're defining your own index on other objects, or include other objects in that index, you need to create your own search engine - the filter field is indexed, but it's ignored during searches. The docs for searching dataobjects fail to mention that as well. This illustrates why we shouldn't have those records in the index in the first place.This has some of the same rollout issues as filter out records with anonymous canView=false, we'd need to advise devs on how to get those existing index entries with ShowInSearch=false out of their index.
When releasing this, we should strongly recommend to search users that they run a reindex to prune any records with ShowInSearch=0 in their current search state.
Another thing to check: when we change ShowInSearch from 1 to 0 and click save/publish, does it get removed from the index straight away?
ACs
Note:
We still have to index the ShowInSearch property, otherwise search request filtering on existing implementations will fail
PRs:
https://github.com/silverstripe/silverstripe-fulltextsearch/pull/275 https://github.com/silverstripe/cwp/pull/260 https://github.com/silverstripe/cwp-search/pull/31
Subtasks
Links
https://silverstripeltd.slack.com/archives/CLXKD9X51/p1584670614347800