mitre / quaerite

Search relevance evaluation toolkit
Other
73 stars 14 forks source link

Internal h2 database #41

Open ehaubert opened 5 years ago

ehaubert commented 5 years ago

Could you add a little more information to the readme about the H2 database? Is it used for tracking data internal to the experiments?

Also - I'm looking to add some judgement-less comparison metrics (set distance, Jaccard coefficient, raw recall numbers, etc) with basic recall. How deeply embedded are the judgements in Quaerite?

Will download and play with this also, but wanted to start a conversation. Thanks!

tballison commented 5 years ago

Updated readme...a bit: https://github.com/mitre/quaerite/blob/master/quaerite-examples/README.md

There are already some judgment-less metrics that allow comparisons:

What we don't have and would require some thought/implementation are the judgment-less comparison metrics, where you want to compare the results as you proposed above -- set distance, jaccard, etc.

There are two areas for thought that I see: 1) We'd have to store the results (up to K) for each query for each run...not a problem, just create a new table. 2) Quaerite is intended to run a bunch of experiments. Would we want all pairwise set distance/jaccard across all queries and all runs or would we want aggregated (mean? median?) of those scores like we're currently doing now with the sig_diffs confusion matrices...or both?

ehaubert commented 5 years ago

For #1 - hm, maybe add a flag. Now that I'm looking at this more carefully, you probably don't want it as the default. Spitting out (Q1, D1, E1, field1_score, field2_score, ....) may not be useful for all (most?) experiments. Thinking of a variant of this:https://github.com/mitre/quaerite/blob/master/quaerite-analysis/src/main/java/org/mitre/quaerite/analysis/CompareAnalyzers.java But maybe 'compare fields' rather than 'compare analyzers'?

For #2 - The use case is very similar (maybe exactly the same) as the confusion matrices. I'm thinking of the case where we don't necessarily have a fixed definition of 'good', but want to tag which changes in input cause large changes in the result set. Walking the parameter space, ID-ing where the result space has different-enough-to-care behavior; those are the combinations that warrant more investigation. So aggregating loses important information.... I think ((Q1, D1, E1, Metrics) , (Q1, D1, E2, metrics), ...)

Really just digging in - apologies if this is what it is already doing.

tballison commented 5 years ago

See: https://github.com/mitre/quaerite/issues/43 on storing results

But maybe 'compare fields' rather than 'compare analyzers'?

Please elaborate...not sure what you mean.

Aggregations at the experiment level...makes sense, and, y, there is a model for running calculations after the experiments have all completed in the confusion matrices. Would require implementation, but would be valuable.

tballison commented 5 years ago

Really just digging in

When you find, um, surprises, i can nearly guarantee it is quaerite, not you. This is still alpha, but I look forward to help in getting this to beta1.

tballison commented 5 years ago

@ehaubert ...if you are still interested in this, see: https://github.com/mitre/quaerite/issues/45#issuecomment-505898229