robbieaverill commented 6 years ago

Original issue: #163

The method proposed in #188 enables support for this, but using DataList::filterByCallback after returning a full set of pages is expensive and will become a problem in larger data sets.

We should improve this for a future minor release after this module is 2.0 stable.

In a private message, @tractorcow suggested replicating what DataQuery::getFinalisedQuery() does in terms of searching all base element classes for all of their text/searchable fields, then left joining your way back up to the page that owns them.

Limitation would be that we set a default/fixed depth so it won't be arbitrary and won't let you search in scalable depth. We could however make this configurable or extensible for devs to customise it to suit their own projects.

Some example pseudo code:

$blocks = ClassInfo::dataClassesFor(BaseElement::class);
$fields = [];
foreach ($blocks as $blockClass) {
   $fields = array_unique(array_merge($fields, $blockClass::config()->get('search_fields')));
}
$conditions = [];
foreach ($fields as $field) {
   $conditions[$field] = $text;
}
$pages = BaseElement::get()->filterAny($conditions)->column('PageID'); // needs to join via ElementArea to get the page ID
$pageIds = SiteTree::get()->byIDs($pages);
// use $pageIds to add extra filter to the parent DataList after other filters have been applied to it

Example code would need to join to the ElementArea that owns the block to be able to get the page ID. We could add an extension point here so as mentioned earlier devs could add more specific joins for deeper nesting into their own project code.

robbieaverill commented 6 years ago

Discussed with @tractorcow - suggested approach:

Add a many_many through relationship from SiteTree (not whatever the elemental extension is on) to BaseElement
When a page that has elemental applied is written, find all nested element IDs and collate into a list
Set the many many relations e.g. $this->owner->FlattenedBlocks()->setByIDs($collatedListOfIds);

Then when searching, we collate a list of all searchable fields from all content blocks, add them all to a BaseElement query like this:

$conditions = [];
foreach ($classes as $class) {
    $classFields = $class::create()->searchableFields();
    $tableName = DataObject::getSchema()->tableName($class);

    foreach ($classFields as $classField => $specs) {
        $conditions['"' . $tableName . '"."' . $classField . '" LIKE ?'] = '%' . $searchTerm . '%';
    }
}

$matchingElementIds = BaseElement::get()
    ->whereAny($conditions)
    ->column('ID');

Note that multiple conditions will exist where search fields are used across multiple element types (notable examples are ID, Title and LastEdited which are defined on BaseElement so will be used for all elements that extend it).

From here we can use the many_many through denormalised relationship mentioned earlier to filter by these IDs *:

return SiteTree::get()->filter(['FlattenedBlocks.ID' => $matchingElementIds]);

If we can apply these filters on top of the default search filters from SiteTreeFilter_Search then regular CMS search should still work as well (note: it works in the current implementation too).

*: in theory

chillu commented 6 years ago

The solution is the shortest path to a baseline implementation, but I think it's a missed opportunity. We should at least evaluate all possible options.

The CMS search was always quite clunky - here's one example from silverstripe.org where a search for "security" brings up the "security releases" page on tenth place

Option 1: Joins

Pro: Combines partial match (LIKE) search with faceted search (e.g. date ranges on particular columns)
Pro: Closest to status quo (using CMSSiteTreeFilter and PartialMatchFilter which produces LIKE queries)
Pro: Doesn't require denormalisation, uses original data
Con: No relevancy ranking across columns, a page/block that contains a term in one column will be ranked just as highly as another page/block that contains it in every column. There's no way to rank matches in "title" field higher than in "content" field.

Note: Either option could switch to using FULLTEXT search for fuzzy matches (e.g. "car port" vs. "carport") and intelligence around multi-term searches

Option 1a: Nested Joins

Con: Could result in joining dozens of tables for larger sites (every content block type), plus dozens more for additional types beyond blocks (e.g. "GalleryBlockItem"). If we make this deep searching opt-in for devs, very few authors will actually see the benefits
Con: Unclear how you'd combine type-based joins with object hierarchy joins (temp tables?)

Option 1b: Flattened Joins

Con: Two-stage process - search blocks, then include matching blocks in page search. No meaningful relevancy ranking between the two result sets. Could be resolved via nested joins (same cons from Option 1a apply)
Con: Denormalisation - need to keep relationships in sync on any ORM updates

Option 2: Denormalise

Con: Roughly doubles uncompressed database storage, which has follow on effects like increasing backup runtimes and hosting costs
Con: More complex search engine behaviour debugging
Con: Requires migration task for existing websites (and potentially time intensive creation of index on deployment)

Option 2a: Denormalise by "all content" column across tables

Store all content in an object hierarchy on the "root" object (e.g. blocks, nested blocks, other related records, all would be stored on page). Create a separate SearchIndex table. Note that MySQL doesn't support FULLTEXT indexes across tables, so this would need to happen on an app level.

There's two ways of populating this index table:

Add one row per object graph for the "root" object, and aggregate content across objects and columns there.
Add one row per object in object graph, and search across records based on flattened relation data
- Pro: Only requires expensive hierarchy traversal on write time (could be done via background queue)
- Con: No per-column relevancy matching (since data is aggregated)
- Con: No combination with faceted search, unless we specifically copy column data over. This will make it harder to customise ModelAdmin and GridField searches while also retaining fulltext search abilities.
- Con: Any aggregation on the "root" object would have to be updated whenever any any included record changes (e.g. if you put Element.Content and Page.Content into a combined SearchIndex.Content field, any update to Element will need to perform this aggregation again, including Page.Content)

Option 2b: Denormalise by "all content" column within a single table

Similar to "Option 2e: Denormalise through database index per table", but storing index as column data (not database metadata).

Pro: Allows relevancy ranking across columns
Con: Same concerns as with "nested joins" to get cross table behaviour
Con: All of Option 2a

Option 2c: Denormalise by storing rendered HTML

Pro: Doesn't need to know anything about the internal model relationships
Con: Requires HTML output, doesn't work on headless or app-based systems
Con: Requires HTML templates to be completed before output becomes searchable
Con: All of Option 2a

Option 2d: Denormalise via fulltext search engine

Pro: Fully featured search implementation (e.g. ranking, fuzzy matching)
Pro: Potential to reuse indexes created for website visitors
Con: Requires additional system components (e.g. Solr), would need to be optional
Con: Moves away from our CMSSiteTreeFilter PHP API, wouldn't cleanly apply to a Solr query
Con: If we're making it optional, any modules augmenting CMS search abilities would likely need two logic paths for search indexes (one database-based, one Solr-based)

Option 2e: Denormalise through database index per table

Create a combined index across columns in one table. MySQL supports this for FULLTEXT indexes (e.g. MATCH(col1, col2, col3) AGAINST "my term"). Similar to "Option 2b: Denormalise by "all content" column within a single table".

Pro: Increases index sizes, but doesn't alter content storage requirements
Con: Does not allow relevancy ranking across columns
Con: Same concerns as with "nested joins" to get cross table behaviour
Con: All of Option 2a

Note: I think we've stopped using MySQL FULLTEXT as part of migrating from MyISAM to InnoDB in order to get transaction support, and at some earlier point InnoDB didn't have fulltext capabilities. That's been solved with MySQL 5.6, which came out in 2013: https://www.percona.com/blog/2013/02/26/myisam-vs-innodb-full-text-search-in-mysql-5-6-part-1/.

robbieaverill commented 6 years ago

It's worth noting that the way Elemental worked in the SS3 version was pretty similar to option 2c - we moved away from this to stop duplicating data in the DB (for performance reasons), but it did have the benefit of working natively with both CMS and frontend searches since the denormalised content is still stored in the Content field.

chillu commented 6 years ago

A bit more detail on "Option 2a: Denormalise by "all content" column across tables": Specifically, this concern here is a big one:

Con: No combination with faceted search, unless we specifically copy column data over. This will make it harder to customise ModelAdmin and GridField searches while also retaining fulltext search abilities.

I've also added "Option 2e: Denormalise through database index per table" to distinguish it from "Option 2a".

Here's an example on how Option 2a could work out, which illustrates limitations of that approach:

Page has_many Block GalleryBlock has_many GalleryBlockItem

Page: Created, LastEdited, Title, Content, Description, IsFeatured GalleryBlock: Created, LastEdited, Title, Content GalleryBlockItem: Created, LastEdited, SlideTitle, SlideDescription, Copyright

SearchIndex: ObjectID, ObjectClassName, Created, LastEdited, Title, Content, Page_IsFeatured

Page indexes Created into SearchIndex.Created Page indexes LastEdited into SearchIndex.LastEdited Page indexes IsFeatured into SearchIndex.Page_IsFeatured Page indexes Title into SearchIndex.Title Page indexes Content and Description into SearchIndex.Content GalleryBlock indexes Title into SearchIndex.Title GalleryBlock indexes Content into SearchIndex.Content GalleryBlock doesn't index Created and LastEdited GalleryBlockItem indexes SlideTitle into SearchIndex.Title GalleryBlockItem indexes SlideDescription and Copyright into SearchIndex.Content GalleryBlockItem doesn't index Created and LastEdited

=> Title can be boosted separately from Content, but not per-column or per-record => Built-in faceted search only applies to root object (Created and LastEdited) => Custom faceted search like IsFeatured requires explicit data copy to index

robbieaverill commented 6 years ago

Discussed with @chillu and we decided to leave this as it is for now and consider it again in more detail in future.

cc @be2n

chillu commented 6 years ago

@robbieaverill Can you please get this on your radar again? Might need to split out some PoC for one or two options, to figure out how feasible they are.

robbieaverill commented 6 years ago

cc @brynwhyman @rupachup

chillu commented 6 years ago

This reared its ugly head again in a different shape: link tracking (https://github.com/silverstripe/silverstripe-cms/issues/2276)

Leapfrognz commented 3 years ago

We have a couple of content-heavy sites (1000+ pages containing large numbers of elements), that are experiencing this issue. Happy to provide more detail if needed.

silverstripe / silverstripe-elemental

Improve method of returning content block data in CMS SiteTree searches #189

Option 1: Joins

Option 1a: Nested Joins

Option 1b: Flattened Joins

Option 2: Denormalise

Option 2a: Denormalise by "all content" column across tables

Option 2b: Denormalise by "all content" column within a single table

Option 2c: Denormalise by storing rendered HTML

Option 2d: Denormalise via fulltext search engine

Option 2e: Denormalise through database index per table