humanmade / altis-enhanced-search

Enhanced Search Module for Altis
https://www.altis-dxp.com/resources/docs/search/
12 stars 2 forks source link

Limit user fields which are indexed by default #408

Open kadamwhite opened 2 years ago

kadamwhite commented 2 years ago

On a large multisite network, we are encountering many errors of this type when indexing users:

{
    "index": {
        "_index": "ep-sitename-user",
        "_type": "_doc",
        "_id": "338",
        "status": 400,
        "error": {
            "type": "illegal_argument_exception",
            "reason": "Limit of total fields [5000] has been exceeded"
        }
    }
}

We have manually excluded some fields using the ep_prepare_user_meta_data hook, but are still running into this error. This ticket proposes further limiting the fields which get indexed by default, so that it is less likely that user indexing will fail on a large network.

Specifically, I wonder whether these fields need to be indexed:

@rmccue suggested the roles and capabilities fields may be needed for user queries, although we weren't sure off the top whether Elasticpress involves itself in the normal queries Core does. But these fields represent a significant percentage of the meta stored against a user, and as a layperson I struggle to imagine how I'd want to search for somebody based on these values.

Acceptance crtieria:

roborourke commented 2 years ago

hm.workflows.* definitely doesn't need to be and I suspect that's responsible for the majority of entries. I would ideally look at the complete list of fields in the mapping to get an idea of whether the others really count as heavily towards the total field count.

ElasticPress can be used for regular queries but in Altis it only kicks in when performing a search by default. Some projects might be using the ep_integrate query var to improve performance of non search related queries however.

roborourke commented 2 years ago

I've updated the acceptance criteria with what I think will give us a good baseline and also start some docs top make these common errors more searchable.

goldenapples commented 2 years ago

In projects using the Roles to Taxonomy plugin, the capabilities meta fields can probably be ignored by default. In other cases, I think we should be able to specify the index type of that field, so that we're only indexing it once instead of multiple times for each of the different indexable types.

roborourke commented 2 years ago

Would be good to know if that plugin plays nicely with capability queries actually 🤔

veselala commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @robindevitt @shadyvb

veselala commented 2 years ago

@roborourke we've estimate this one to 5 SP and moved it to the backlog

kadamwhite commented 1 year ago

@ivankristianto notes in the issue which spawned this suggestion, that the default fields ES does use for indexing are very limited:

$prepared_search_fields = [
    'user_login',
    'user_nicename',
    'display_name',
    'user_url',
    'user_email',
    'meta.first_name',
    'meta.last_name',
    'meta.nickname',
];

He elaborates there,

unless you add more search fields, all other meta in the ES index is useless.

I would recommend to change to treat meta with inclusion method. Only index the fields that you want to search against. You can use this filter ep_prepare_user_meta_excluded_public_keys, get all the user meta keys, filter by the meta you want to use, and then exclude everything else.

This would be more concise overall than maintaining a long list of patterns to test against in order to omit data, and runs less risk of throwing away something that an Altis site does end up wanting to search on.

One final thought: Fields get interpreted into many different types of mapping in ES. I wonder if it is necessary to have long and double representations of a field which is not numeric? or whether we should throw away date-time mapping values if the date resolves to 1970-01-01 00:00:01, as it would with a non-date value? ES computes these because it's not sure what you might expect it to be, but it seems like it contributes to the ballooning number of mappings without providing any benefit. I may misunderstand and these additional mapping types might be necessary to keep, but if they aren't, they could be removed with the ep_user_sync_args filter by looping through the $user_meta['meta'] array.