stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.86k stars 243 forks source link

Visualization for MS Marco #423

Closed rishibommasani closed 7 months ago

rishibommasani commented 2 years ago
  1. Visualization: should print out the probabilities
  2. Visualization: should group all the passages for a given question and show the ranking (ideally)

These may be more general things (at least the first one), but are especially relevant for this scenario.

percyliang commented 1 year ago

Now that ranking instances include all the candidates, we don't need to do anything special for showing the instances. However, we need to branch on the type of adaptation method == BINARY_RANKING and show the predictions (which are on individual references) in a meaningful way.

dilarasoylu commented 1 year ago

Sounds great, @percyliang, are you referring to the model ranking as the prediction? If so, we don't really record this information anywhere. Our options are:

  1. Re-compute the ranking on the JS side: this won't be hard since all we are checking is the logprob and the answer tokens. It is a slightly ugly solution as the frontend need to know about the answer tokens etc.
  2. Save "Reference Level Metrics" somewhere: Currently, we record instance level metrics, but don't keep track of reference level metrics.
percyliang commented 1 year ago

I think we need to store something in the per_instance_metrics.json (related to #905 for even just normal classification).

dilarasoylu commented 1 year ago

@percyliang, got it, this can be slightly more trickier for ranking as we need a rank for each reference and the number of references isn't fixed.

I see a potential solution that involves string parsing following your example in #905: We can have a stat named f"rank_{reference_index}" for each reference, and set its value to the corresponding rank. How does this sound?

percyliang commented 1 year ago

Yes, that will do for now. Eventually, I think we might want to have a more principled way of encoding this information. I'd call it "ref{reference_index}_rank" to be more descriptive.

dilarasoylu commented 1 year ago

Added in #1013

yifanmai commented 1 year ago

It seems to me that we'll need to re-run MS MARCO, and there's no way to generate the new per-instance stats from the existing information. Is this correct?

yifanmai commented 1 year ago

Never mind, I see that ref{reference_index}_rank already exists.

yifanmai commented 1 year ago

What reminds here? This is how things look like currently. Do we only want to display top ranked options only?

Screenshot_20221109_114820

Also, I think we should remove ref{reference_index}_rank from the global metrics below:

Screenshot_20221109_114937

dilarasoylu commented 1 year ago

Thanks Yifan! It might be good to indicate the model ordering in a bracket as well.

How can we remove the global metrics? We compute these metrics at an individual instance, but they get averaged and result in the global metrics.

yifanmai commented 1 year ago

The rank are already in brackets (see "rank=492" in the last entry). Not sure why some entries have them and some don't.

For the global metrics, the easiest thing to do is probably to add a filter list on the frontend to filter out these metrics from the table.

yifanmai commented 7 months ago

Closing because MS Marco is deprecated-ish i.e. it has been removed from Lite.