marklogic-community / Corona

Community REST API for MarkLogic
Other
37 stars 9 forks source link

RFE: Analytical functions against facets #34

Open hunterhacker opened 12 years ago

hunterhacker commented 12 years ago

We could support a few analytical functions against ranges:

Median and other percentiles Mean average Standard deviation Sum

These will run much faster if you don't have to transmit the full batch of data from the server to client.

hunterhacker commented 12 years ago

Idea for implementation:

A new parameter to /facet that's ?include=avg,sum,count,totalvalues,stdev,median,percentiles:5,25,50,75,95,min,max

Some work for numbers only, some work for any data types.

For numbers only:

avg returns the cts:avg($values) // aka the average of all real values, being smart about frequencies

sum returns the cts:sum($values) // aka the sum of all real values, being smart about frequencies

stdev returns the standard deviation of values (no built in, Kelly has code that does it), for numerics only

For all types:

count returns the cts:count($values) // aka the count of all real values, aka the sum of all cts:frequencies

totalvalues returns the count($values)

median returns the median of values (no built in, remember to take into account the frequencies of each, I have code that does it), for numerics only

percentiles returns the items matching those percentiles

min and max return, well, I'll let you guess.

Note that item-frequency vs fragment-frequency has a big impact on these things and that should be well explained.

collwhit commented 12 years ago

Cool, and incredibly useful. Questions...(predictably).

How does this play with bucketed ranges?

And on the last comment, some of those calculations could be really efficient if values are returned in a particular order; how does this play with the facet ordering options (probably orthogonal, but maybe worth a thought?).

hunterhacker commented 12 years ago

Bucketed ranges... Seems like we could return the same results as if there wasn't bucketing happening. Let's imagine you're looking at prices. You can get the min, max, median, stdev, distinct count, percentiles, against the raw results. Plus a nice bucketed version. I don't think I'd want the median bucket.

Dates are a special case? Maybe you're bucketing by day and you probably want the median day rather than the median raw timestamp. It'd be faster to run that to boot. But that's the best example I can think of for trying to do the analytical functions against the buckets themselves. Probably because dates have such natural buckets.

But if I had to pick, I'd say work against the raw data. Note that with ML a few cts:* calls don't work on buckets and you'll have to do a cts:element-values() call in addition to cts:element-value-ranges(). For example, I see cts:count($ranges) returns the same result as if there wasn't bucketing happening (same effect), but you get an error with cts:sum($ranges) and cts:avg($ranges).

The other option is to error if you do bucketing with those calls.

hunterhacker commented 12 years ago

On the last comment... To do median or percentiles we're going to need to fetch the facets in item order. The user may still prefer to get the top 10 most frequent, and if they ask for both we can just do two calls internally to get that arranged for them in one response.