Train-test contamination

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).

https://crfm.stanford.edu/helm

Apache License 2.0

1.94k stars 250 forks source link

Train-test contamination #342

Closed teetone closed 9 months ago

teetone commented 2 years ago

Pilot:

Fix GPT-3
Look at scenarios whose inputs fall in and out of training data
Look at the distribution of log probabilities

rishibommasani commented 2 years ago

Assigning to Percy as reminder/placeholder until we find someone else to take on.

rishibommasani commented 2 years ago

@Tiiiger Maybe you could post updates here, so we can update this as needed. Will keep Percy as the assignee for now, but can switch over at some point if it makes sense.

Tiiiger commented 2 years ago

@teetone @percyliang @teetone Status:

Try to evaluate the perplexity of common benchmark test/validation set of AI21 models.

A lot of test sets are too big to evaluate directly so I am looking into the convergence of subsampling.

rishibommasani commented 2 years ago

@Tiiiger Just to track the status so everyone's on board:

Initial experiments with AI21 models seem promising
Since there are various bottlenecks for further testing with OpenAI/Microsoft/Anthropic/EleutherAI (via Goose) models, next step is testing with public models on HF like GPT-2

Is this right/anything else to add (incl. the experiments/figures whenever you get a chance)?

(Note: Moving this to P2 for now.)

Tiiiger commented 2 years ago

transferring to @fladhak

also i just realized i was a complete idiot pushing these to main directly. Should have created a separate branch and do PR. sorry for this.

yifanmai commented 9 months ago

Closing because contamination tracking is deprecated.