nestauk / dapsboard

DAPS user interface
MIT License
0 stars 0 forks source link

Aggregation constraints #201

Open mindrones opened 3 years ago

mindrones commented 3 years ago

So far we have:

These could be arrays expressed like:

request: {
  params: {
    foo: optional(integer),
    bar: optional(string),
    qoo: optional(number),
    par: optional(string),
    moo: optional(integer),
    car: optional(integer),
  },
  constraints: [
    {kind: 'xor', params: ['foo', 'bar']},
    {kind: 'xor', params: ['qoo', 'par']},
    {kind: 'some', params: ['moo', 'car']},
  ]
}
mindrones commented 3 years ago

Another kind of constraint is when an aggregation parameter type depends on the type of the selected field:

fieldType: number,
label: 'Histogram',
lastChecked: '7.9',
request: {
    ...
    missing: optional(number),
    ...
},

Here we want to express that if the field will be an integer (e.g. for years), then the missing input n the UI can only accept whole numbers.

This could be expressed like:

request: {
    params: {
        ...
        missing: optional(number),
        ...
    },
    constraints: [
        {kind: 'fieldType', params: ['missing']},
    ]
},
mindrones commented 3 years ago

Some aggregation can only be used as a child of another specific aggregation: for example rate can only be used inside a date_histogram.

This could be expressed as a constraints key on the aggregation object:

export default {
    id: 'rate',
    ...
    constraints: [
        {kind: 'parent', aggs: ['date_histogram']}
    ],
    ...
}
mindrones commented 3 years ago

For some aggregations, a parameter can have a set of values depending on a set values in another parameter, potentially in the parent aggregation.

For example, rate's unit has a specific relationship with the interval used by the parent aggregation.

export default {
    id: 'rate',
    ...
    request: {
        params: {
            ...,
            unit: optional(string),
        },
        constraints: [
            {
                kind: 'value-sets',
                filters: [
                    {
                        if: [{
                            agg: 'parent',
                            param: 'calendar_interval',
                            values: calendarIntervals
                        }],
                        then: [{
                            param: 'unit',
                            values: rateIntervalsToWeek
                        }],
                    },
                    {
                        if: [{
                            agg: 'parent',
                            param: 'calendar_interval',
                            values: calendarIntervalsFromMonth
                        }],
                        then: [{
                            param: 'unit',
                            values: rateIntervalsFromMonth
                        }],
                    }
                ],
            }
        ]
    }
}

Should we find a value constraint among fields of the same agg, we might express it by simply omitting the agg key in filters:

request: {
    params: {
        ...,
        foo: string,
        bar: string,
    },
    constraints: [
        {
            kind: 'value-sets',
            filters: [
                {
                    if: [{
                        param: 'foo',
                        values: fooSet1
                    }],
                    then: [{
                        param: 'bar',
                        values: barSet1
                    }],
                },
                {
                    if: [{
                        param: 'foo',
                        values: fooSet2
                    }],
                    then: [{
                        param: 'bar',
                        values: barSet2
                    }],
                },
            ],
        }
    ]
}

Note that with this syntax it'd probably be possible to constraint more than 2 fields:

request: {
    params: {
        ...,
        foo: string,
        bar: string,
        baz: string,
    },
    constraints: [
        {
            kind: 'value-sets',
            filters: [
                {
                    if: [{
                        param: 'foo',
                        values: fooSet1
                    }],
                    then: [{
                        param: 'bar',
                        values: barSet1
                    }],
                },
                {
                    if: [{
                        param: 'foo',
                        values: fooSet2
                    }],
                    then: [
                        {
                              param: 'bar',
                              values: barSet2
                          },
                          {
                              param: 'baz',
                              values: bazSet2
                          },
                    ]
                },
            ],
        }
    ]
}

This syntax expresses directionality: values in certain fields control values in other fields, not the other way around, so the user would have to avoid conflicts. TBD

mindrones commented 3 years ago

Aggregations operate in breadth_first or depth_first collect mode.

Sub aggregations requiring scores are incompatible with breadth_first [1].

These two modes are incompatible because breadth_first does not work when sub aggregations require scores [1].

This can probably be an implicit constraint once all aggregations have collect_mode set, but we might want to consider making it explicit using a constraint at the aggregation level:

export default {
    id: 'rare_terms',
    ...
    collect_mode: 'breadth_first',
    constraints: [
        {kind: 'collect_mode', value: `breadth_first`}
    ],
    ...
}

[1] Examples:

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-rare-terms-aggregation.html#_nested_rareterms_and_scoring_sub_aggregations:

The RareTerms aggregation has to operate in breadth_first mode, since it needs to prune terms as doc count thresholds are breached. This requirement means the RareTerms aggregation is incompatible with certain combinations of aggregations that require depth_first. In particular, scoring sub-aggregations that are inside a nested force the entire aggregation tree to run in depth_first mode. This will throw an exception since RareTerms is unable to process depth_first.

As a concrete example, if rare_terms aggregation is the child of a nested aggregation, and one of the child aggregations of rare_terms needs document scores (like a top_hits aggregation), this will throw an exception.

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-sampler-aggregation.html#sampler-breadth-first-nested-agg

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

mindrones commented 3 years ago

Some aggregations don't support child aggregations. [1]

This could be:

export default {
    id: 'significant_text',
    ...
    constraints: [
        {kind: 'no-children'}
    ],
    ...
}

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-significanttext-aggregation.html#_no_support_for_child_aggregations

mindrones commented 3 years ago

Some aggregations cannot be used with text fields in nested objects. [1]

This could be expressed like this:

export default {
    id: 'significant_text',
    ...
    constraints: [
        {kind: 'no-nested-objects'}
    ],
    ...
}

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-significanttext-aggregation.html#_no_support_for_nested_objects

mindrones commented 3 years ago

Some parameters have to be greater than others, e.g.

shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, Elasticsearch will override it and reset it to be equal to size.

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-significantterms-aggregation.html#sig-terms-shard-size

In this case we might use:

export default {
    id: 'significant_text',
    ...
    request: {
        params: {
            ...,
            shard_size: optional(integerD(-1)),
            size: optional(integerD(10, true)),
        },
        constraints: [
            {kind: 'gt', params: [`shard_size`, `size`]}
        ]
    }
}

In this particular case, for example, the constraint should be valid only if shard_size is positive:

If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.

so we might need to think about how to express exceptions, as some kind of conditional constraints:


export default {
    id: 'significant_text',
    ...
    request: {
        params: {
            ...,
            shard_size: optional(integerD(-1)),
            size: optional(integerD(10, true)),
        },
        constraints: [
            {
                if: [
                    {kind: 'gt-value', params: [`shard_size`], value: 0}
                ],
                then: [
                    {kind: 'gt', params: [`shard_size`, `size`]}
                ],
            }
        ]
    }
}```
mindrones commented 3 years ago

In this case,

This aggregation cannot currently be nested under any aggregation that collects from more than a single bucket.

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html

Once we give a name to the group of aggregations that collects from a single bucket (say single-bucket) we need to assign that to a prop (say foo) in all aggs, then express this constraint at the agg level with something like:

export default {
    id: 'variable_width_histogram',
    ...
    constraints: [
        {kind: 'parent-type', key: 'foo', values: ['single-bucket']}
    ],
    ...
}
mindrones commented 3 years ago

Some parameter have a max:

Parameters buckets, shard_size, and initial_buffer are optional. By default, buckets = 10, shard_size = buckets 50, and initial_buffer = min(10 shard_size, 50000).

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-bucket-variablewidthhistogram-aggregation.html#_shard_size_4

This could be:

export default {
    id: 'variable_width_histogram',
    ...
    request: {
        params: {
            ...,
            initial_buffer: optional(integerD(5000)),
        },
        constraints: [
            {kind: 'max-value', params: ['initial_buffer'], value: 50000}
        ]
    }
}

Likewise, parametersr can have a minimum:

sigma can be any non-negative double

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-aggregations-metrics-extendedstats-aggregation.html#_standard_deviation_bounds

which would be expressed with:

export default {
    id: 'extended_stats',
    ...
    request: {
        params: {
            ...,
            sigma: optional(floatD(2)),
        },
        constraints: [
            {kind: 'min-value', params: ['sigma'], value: 0}
        ]
    }
}