silverstripe / silverstripe-elemental

Create pages in Silverstripe CMS using content blocks
http://dna.co.nz
BSD 3-Clause "New" or "Revised" License
110 stars 115 forks source link

Improve method of returning content block data in CMS SiteTree searches #189

Open robbieaverill opened 6 years ago

robbieaverill commented 6 years ago

Original issue: #163

The method proposed in #188 enables support for this, but using DataList::filterByCallback after returning a full set of pages is expensive and will become a problem in larger data sets.

We should improve this for a future minor release after this module is 2.0 stable.

In a private message, @tractorcow suggested replicating what DataQuery::getFinalisedQuery() does in terms of searching all base element classes for all of their text/searchable fields, then left joining your way back up to the page that owns them.

Limitation would be that we set a default/fixed depth so it won't be arbitrary and won't let you search in scalable depth. We could however make this configurable or extensible for devs to customise it to suit their own projects.

Some example pseudo code:

$blocks = ClassInfo::dataClassesFor(BaseElement::class);
$fields = [];
foreach ($blocks as $blockClass) {
   $fields = array_unique(array_merge($fields, $blockClass::config()->get('search_fields')));
}
$conditions = [];
foreach ($fields as $field) {
   $conditions[$field] = $text;
}
$pages = BaseElement::get()->filterAny($conditions)->column('PageID'); // needs to join via ElementArea to get the page ID
$pageIds = SiteTree::get()->byIDs($pages);
// use $pageIds to add extra filter to the parent DataList after other filters have been applied to it

Example code would need to join to the ElementArea that owns the block to be able to get the page ID. We could add an extension point here so as mentioned earlier devs could add more specific joins for deeper nesting into their own project code.

robbieaverill commented 6 years ago

Discussed with @tractorcow - suggested approach:

Then when searching, we collate a list of all searchable fields from all content blocks, add them all to a BaseElement query like this:

$conditions = [];
foreach ($classes as $class) {
    $classFields = $class::create()->searchableFields();
    $tableName = DataObject::getSchema()->tableName($class);

    foreach ($classFields as $classField => $specs) {
        $conditions['"' . $tableName . '"."' . $classField . '" LIKE ?'] = '%' . $searchTerm . '%';
    }
}

$matchingElementIds = BaseElement::get()
    ->whereAny($conditions)
    ->column('ID');

Note that multiple conditions will exist where search fields are used across multiple element types (notable examples are ID, Title and LastEdited which are defined on BaseElement so will be used for all elements that extend it).

From here we can use the many_many through denormalised relationship mentioned earlier to filter by these IDs *:

return SiteTree::get()->filter(['FlattenedBlocks.ID' => $matchingElementIds]);

If we can apply these filters on top of the default search filters from SiteTreeFilter_Search then regular CMS search should still work as well (note: it works in the current implementation too).

*: in theory

chillu commented 6 years ago

The solution is the shortest path to a baseline implementation, but I think it's a missed opportunity. We should at least evaluate all possible options.

The CMS search was always quite clunky - here's one example from silverstripe.org where a search for "security" brings up the "security releases" page on tenth place

image

Option 1: Joins

Note: Either option could switch to using FULLTEXT search for fuzzy matches (e.g. "car port" vs. "carport") and intelligence around multi-term searches

Option 1a: Nested Joins

Option 1b: Flattened Joins

Option 2: Denormalise

Option 2a: Denormalise by "all content" column across tables

Store all content in an object hierarchy on the "root" object (e.g. blocks, nested blocks, other related records, all would be stored on page). Create a separate SearchIndex table. Note that MySQL doesn't support FULLTEXT indexes across tables, so this would need to happen on an app level.

There's two ways of populating this index table:

  1. Add one row per object graph for the "root" object, and aggregate content across objects and columns there.
  2. Add one row per object in object graph, and search across records based on flattened relation data

    • Pro: Only requires expensive hierarchy traversal on write time (could be done via background queue)
    • Con: No per-column relevancy matching (since data is aggregated)
    • Con: No combination with faceted search, unless we specifically copy column data over. This will make it harder to customise ModelAdmin and GridField searches while also retaining fulltext search abilities.
    • Con: Any aggregation on the "root" object would have to be updated whenever any any included record changes (e.g. if you put Element.Content and Page.Content into a combined SearchIndex.Content field, any update to Element will need to perform this aggregation again, including Page.Content)

Option 2b: Denormalise by "all content" column within a single table

Similar to "Option 2e: Denormalise through database index per table", but storing index as column data (not database metadata).

Option 2c: Denormalise by storing rendered HTML

Option 2d: Denormalise via fulltext search engine

Option 2e: Denormalise through database index per table

Create a combined index across columns in one table. MySQL supports this for FULLTEXT indexes (e.g. MATCH(col1, col2, col3) AGAINST "my term"). Similar to "Option 2b: Denormalise by "all content" column within a single table".

Note: I think we've stopped using MySQL FULLTEXT as part of migrating from MyISAM to InnoDB in order to get transaction support, and at some earlier point InnoDB didn't have fulltext capabilities. That's been solved with MySQL 5.6, which came out in 2013: https://www.percona.com/blog/2013/02/26/myisam-vs-innodb-full-text-search-in-mysql-5-6-part-1/.

robbieaverill commented 6 years ago

It's worth noting that the way Elemental worked in the SS3 version was pretty similar to option 2c - we moved away from this to stop duplicating data in the DB (for performance reasons), but it did have the benefit of working natively with both CMS and frontend searches since the denormalised content is still stored in the Content field.

chillu commented 6 years ago

A bit more detail on "Option 2a: Denormalise by "all content" column across tables": Specifically, this concern here is a big one:

Con: No combination with faceted search, unless we specifically copy column data over. This will make it harder to customise ModelAdmin and GridField searches while also retaining fulltext search abilities.

I've also added "Option 2e: Denormalise through database index per table" to distinguish it from "Option 2a".

Here's an example on how Option 2a could work out, which illustrates limitations of that approach:


Page has_many Block GalleryBlock has_many GalleryBlockItem

Page: Created, LastEdited, Title, Content, Description, IsFeatured GalleryBlock: Created, LastEdited, Title, Content GalleryBlockItem: Created, LastEdited, SlideTitle, SlideDescription, Copyright

SearchIndex: ObjectID, ObjectClassName, Created, LastEdited, Title, Content, Page_IsFeatured

Page indexes Created into SearchIndex.Created Page indexes LastEdited into SearchIndex.LastEdited Page indexes IsFeatured into SearchIndex.Page_IsFeatured Page indexes Title into SearchIndex.Title Page indexes Content and Description into SearchIndex.Content GalleryBlock indexes Title into SearchIndex.Title GalleryBlock indexes Content into SearchIndex.Content GalleryBlock doesn't index Created and LastEdited GalleryBlockItem indexes SlideTitle into SearchIndex.Title GalleryBlockItem indexes SlideDescription and Copyright into SearchIndex.Content GalleryBlockItem doesn't index Created and LastEdited

=> Title can be boosted separately from Content, but not per-column or per-record => Built-in faceted search only applies to root object (Created and LastEdited) => Custom faceted search like IsFeatured requires explicit data copy to index

robbieaverill commented 6 years ago

Discussed with @chillu and we decided to leave this as it is for now and consider it again in more detail in future.

cc @be2n

chillu commented 6 years ago

@robbieaverill Can you please get this on your radar again? Might need to split out some PoC for one or two options, to figure out how feasible they are.

robbieaverill commented 6 years ago

cc @brynwhyman @rupachup

chillu commented 6 years ago

This reared its ugly head again in a different shape: link tracking (https://github.com/silverstripe/silverstripe-cms/issues/2276)

Leapfrognz commented 3 years ago

We have a couple of content-heavy sites (1000+ pages containing large numbers of elements), that are experiencing this issue. Happy to provide more detail if needed.