Closed gururise closed 1 year ago
Don't know if I have an optimal prompt, but here's the initial results:
dataset | model | Squad "Mini" (f1) | Piqa (acc) |
---|---|---|---|
Original Alpaca | samwit/alpaca7B-lora | 34.63 | 50.5 |
Cleaned Alpaca (Mar 27) | tloen/alpaca-lora-7b | 49.64 | 54.0 |
Isn't the random performance around 50 since there are only two answers?
First attempt to add the PIQA dataset (validation split) to benchmark suite.