Related to #51, this is just to keep in mind that, at the moment, we are okay keeping examples where models don't fully follow instructions, or start producing gibberish. But later on, we might consider flagging (e.g., using few-shot learning with SetFit) weird examples -- both to quantify how many there are, and compare how the amount of weird stuff changes across datasets, models and decoding parameters.
Related to #51, this is just to keep in mind that, at the moment, we are okay keeping examples where models don't fully follow instructions, or start producing gibberish. But later on, we might consider flagging (e.g., using few-shot learning with
SetFit
) weird examples -- both to quantify how many there are, and compare how the amount of weird stuff changes across datasets, models and decoding parameters.