huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.24k stars 26.61k forks source link

Evaluate on subset of evaluation dataset during training #15812

Closed marinone94 closed 2 years ago

marinone94 commented 2 years ago

🚀 Feature request

Allow evaluating the model on a random subset of the evaluation dataset at each evaluation step during training.

Motivation

While participating in the Robust Speech Recognition Challenge in January / February 2022, I noticed that evaluating the model on the whole evaluation dataset increased the training time considerably (Swedish Common Voice, in that specific case).

Sampling a subset of the evaluation set before training is not the optimal approach as it would lead to a less realistic model evaluation.

The same issue was discussed here.

Your contribution

I will update the Trainer and TrainingArguments classes, HFArgumentParser (if needed), tests, and docs to allow this feature, before submitting a Pull Request.

Usage example


from transformers import TrainingArguments

[...]

training_args = TrainingArguments(
    ...,
    eval_subset_ratio=0.1,
    eval_subset_seed=0
)

[...]
patil-suraj commented 2 years ago

cc @patrickvonplaten @anton-l

patrickvonplaten commented 2 years ago

Hey @marinone94,

Thanks for your issue! Do you think the https://github.com/huggingface/transformers/blob/b7e292aebdb638e2238cd9febf8c09253195fb5d/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L181 option is not enough here?

marinone94 commented 2 years ago

Hej @patrickvonplaten, by using max_eval_samples one always evaluates on the same subset (in the example, the top max_eval_samples examples, but it could well be a random selection, it would not change my point).

What I would like to offer (and use myself) is to dynamically sample max_eval_samples before each eval step, so that the model is not always evaluated on the same subset during training, and then run a full evaluation at the end of the training. I think it would expand the library's offer while keeping it fully compatible with previous versions.

patrickvonplaten commented 2 years ago

It's pretty difficult to compare randomly sampled eval datasets to each other no? E.g. if epoch 1 gives me 40% WER on a subset 1 and epoch 2 gives me 40% on a subset 2 which is completely different from 1 then I can't really use this information this much no?

marinone94 commented 2 years ago

Well, if the model generalises well the trend should be similar, but I see your point.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

minesh1291 commented 1 year ago

I tried to use random samples on each eval in the following code:

class CustomTrainer(transformers.Trainer):
    def get_eval_dataloader(self, eval_dataset=None) -> torch.utils.data.DataLoader:
        use_n_samples = 25_000
        if use_n_samples < len(self.eval_dataset):
            eval_idx = np.random.choice(
                range(0, len(self.eval_dataset)), size=use_n_samples, replace=False)
            eval_dataset_subset = torch.utils.data.Subset(
                self.eval_dataset, eval_idx)
        else:
            eval_dataset_subset = self.eval_dataset

        loader = torch.utils.data.DataLoader(
            eval_dataset_subset,
            batch_size=self.args.eval_batch_size,
            collate_fn=self.data_collator,
            num_workers=self.args.dataloader_num_workers,
            pin_memory=self.args.dataloader_pin_memory,
            # sampler =
        )
        return loader
brando90 commented 1 year ago

fyi best solution I know:

    # - Get eval data set (AF for us), https://huggingface.co/datasets/brando/debug1_af
    per_device_eval_batch_size = 4  # TODO: change to something larger, right now due to size of my debug0
    eval_accumulation_steps=1
    eval_steps=1
    # eval_steps=1000
    # TODO: probably need to write a collate_fn for the eval so that the eval is done right?
    # TODO: we need ppl (and ideally token edit distance for eval, reason explained here: https://arxiv.org/abs/2304.15004)
    path, name = 'brando/debug1_af', None
    eval_dataset = load_dataset(path, name, streaming=False, split="test").with_format(type="torch") 
    eval_dataset = eval_dataset.select(range(per_device_eval_batch_size))
    ## eval_dataset = train_dataset  # TODO: fix obviously to something else using af
    raw_text_batch = eval_dataset.take(per_device_eval_batch_size) if streaming else eval_dataset.select(range(per_device_eval_batch_size))
    print(f'{raw_text_batch=}')
    print(f'{next(iter(raw_text_batch))=}')
    column_names = next(iter(raw_text_batch)).keys()
    def eval_preprocess(examples):
        return tokenizer(examples["formal statement"] + [' '] + examples["generated informal statement"], padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(eval_preprocess, batched=True, remove_columns=remove_columns)
    eval_dataset = map(eval_dataset)
    train_dataset = train_dataset

    # -- Compute max steps
    per_device_train_batch_size = batch_size
    print(f'{per_device_train_batch_size=}')
    # dataset_size: int = int(1.5e12)  # TODO, doesn't seem easy to solve. Either count all the sequennces/rows or have the meta data have this. Or make this number huge. 
    dataset_size: int = train_dataset.num_rows
    # dataset_size: int = len(train_dataset)
    # TODO dataset.info['split']['train']['num_examples']
    # dataset_size = sum(len(dataset) for dataset in datasets)  # TODO: works on with streaming = False?
    # dataset_size = sum(dataset.cardinality() for dataset in datasets)
    print(f'{dataset_size=}')
    # # TODO: feel free to fix the issue if I'm not seeing all the data points...
    # num_epochs = 1
    max_steps = (dataset_size // per_device_train_batch_size) * num_epochs
    print(f'{num_epochs=} {max_steps=}')
    ## DOESNT WORK num_train_epochs = 3  # TODO: since I decided to do streaming = False and if we collect enough data it's unlikely we see it all hopefully (if we do 3 times seems good given that LLMs are trained to see the data only once this seems a sensible soln, + in the imagenet days things were trained to convergence with no overfitting ref: https://arxiv.org/abs/1801.00173)

    # -- Define custom collate function
    def custom_collate_fn(data: list[dict[str, str]], tokenizer: PreTrainedTokenizer) -> dict[str, torch.Tensor]:
        """ trains on first occurence of eos

        ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando 
        ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
        ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
        """
        # we are training full context length forllama so remove code bellow, if it triesto pad hopefully it throws an error
        # -- Ensure tokenizer has a padding token
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        # -- Extract sequences
        # sequences: list[str] = [example.get("text", "") or "" for example in data]
        sequences: list[str] = []
        for idx, example in enumerate(data):
            # Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
            text: str = example.get("text", "") or ""
            sequences.append(text)
        # -- Tokenize the sequences
        tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
        tokenized_data["labels"] = tokenized_data["input_ids"].clone()  # labels is hardcoded in HF so put it!
        # -- Set the mask value for the first eos_token in each sequence to 1
        eos_token_id = tokenizer.eos_token_id
        for idx, input_ids in enumerate(tokenized_data["input_ids"]):
            # Find all occurrences of eos_token
            eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
            if eos_positions.nelement() > 0:  # Check if eos_token is present
                first_eos_position = eos_positions[0]
                tokenized_data["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1

                # Assert that the label for the first occurrence of eos_token is eos_token_id
                assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"

                # For all subsequent occurrences of eos_token, set their labels to -100
                for subsequent_eos_position in eos_positions[1:]:
                    tokenized_data["labels"][idx, subsequent_eos_position] = -100
                    assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
        return tokenized_data

    # - Debug before training to see data
    sample_data = train_dataset.select(range(per_device_train_batch_size)) if not isinstance(train_dataset, datasets.iterable_dataset.IterableDataset) else train_dataset.take(per_device_train_batch_size)
    processed_data = custom_collate_fn(sample_data, tokenizer=tokenizer)
    print(f'{processed_data=}')

    # -- Training arguments and trainer instantiation ref: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments
    output_dir = Path(f'~/data/maf_data/results_{today}/').expanduser() if not debug else Path(f'~/data/maf_data/results/').expanduser()
    print(f'{debug=} {output_dir=} \n {report_to=}')
    training_args = TrainingArguments(
        output_dir=output_dir,  #The output directory where the model predictions and checkpoints will be written.
        # num_train_epochs = num_train_epochs, 
        max_steps=max_steps,  # TODO: hard to fix, see above
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        optim="paged_adamw_32bit",  # David hall says to keep 32bit opt https://arxiv.org/pdf/2112.11446.pdf TODO: if we are using brain float 16 bf16 should we be using 32 bit? are optimizers always fb32?  https://discuss.huggingface.co/t/is-there-a-paged-adamw-16bf-opim-option/51284
        warmup_steps=500,  # TODO: once real training starts we can select this number for llama v2, what does llama v2 do to make it stable while v1 didn't?
        warmup_ratio=0.03,  # copying alpaca for now, number of steps for a linear warmup, TODO once real training starts change? 
        # weight_decay=0.01,  # TODO once real training change?
        weight_decay=0.00,  # TODO once real training change?
        learning_rate = 1e-5,  # TODO once real training change? anything larger than -3 I've had terrible experiences with
        max_grad_norm=1.0, # TODO once real training change?
        lr_scheduler_type="cosine",  # TODO once real training change? using what I've seen most in vision 
        logging_dir=Path('~/data/maf/logs').expanduser(),
        save_steps=2000,  # alpaca does 2000, other defaults were 500
        # logging_steps=250,
        # logging_steps=50,  
        logging_steps=1,
        remove_unused_columns=False,  # TODO don't get why https://stackoverflow.com/questions/76879872/how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # change to wandb!
        fp16=False,  # never ever set to True
        bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8,  # if >= 8 ==> brain float 16 available or set to True if you always want fp32
        evaluation_strategy='steps',
        per_device_eval_batch_size=per_device_eval_batch_size,
        eval_accumulation_steps=eval_accumulation_steps,
        eval_steps=eval_steps,
    )
    # print(f'{training_args=}')
    print(f'{pretrained_model_name_or_path=}')

    # TODO: might be nice to figure our how llamav2 counts the number of token's they've trained on
    print(f'{train_dataset=}')
    print(f'{eval_dataset=}')
    trainer = Trainer(
        model=model,
        args=training_args,  
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: custom_collate_fn(data, tokenizer=tokenizer)
    )

All links related to this Q:

WhenMelancholy commented 3 months ago

fyi best solution I know:仅供参考,我知道的最佳解决方案:

    # - Get eval data set (AF for us), https://huggingface.co/datasets/brando/debug1_af
    per_device_eval_batch_size = 4  # TODO: change to something larger, right now due to size of my debug0
    eval_accumulation_steps=1
    eval_steps=1
    # eval_steps=1000
    # TODO: probably need to write a collate_fn for the eval so that the eval is done right?
    # TODO: we need ppl (and ideally token edit distance for eval, reason explained here: https://arxiv.org/abs/2304.15004)
    path, name = 'brando/debug1_af', None
    eval_dataset = load_dataset(path, name, streaming=False, split="test").with_format(type="torch") 
    eval_dataset = eval_dataset.select(range(per_device_eval_batch_size))
    ## eval_dataset = train_dataset  # TODO: fix obviously to something else using af
    raw_text_batch = eval_dataset.take(per_device_eval_batch_size) if streaming else eval_dataset.select(range(per_device_eval_batch_size))
    print(f'{raw_text_batch=}')
    print(f'{next(iter(raw_text_batch))=}')
    column_names = next(iter(raw_text_batch)).keys()
    def eval_preprocess(examples):
        return tokenizer(examples["formal statement"] + [' '] + examples["generated informal statement"], padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
    remove_columns = column_names  # remove all keys that are not tensors to avoid bugs in collate function in task2vec's pytorch data loader
    def map(batch):
        return batch.map(eval_preprocess, batched=True, remove_columns=remove_columns)
    eval_dataset = map(eval_dataset)
    train_dataset = train_dataset

    # -- Compute max steps
    per_device_train_batch_size = batch_size
    print(f'{per_device_train_batch_size=}')
    # dataset_size: int = int(1.5e12)  # TODO, doesn't seem easy to solve. Either count all the sequennces/rows or have the meta data have this. Or make this number huge. 
    dataset_size: int = train_dataset.num_rows
    # dataset_size: int = len(train_dataset)
    # TODO dataset.info['split']['train']['num_examples']
    # dataset_size = sum(len(dataset) for dataset in datasets)  # TODO: works on with streaming = False?
    # dataset_size = sum(dataset.cardinality() for dataset in datasets)
    print(f'{dataset_size=}')
    # # TODO: feel free to fix the issue if I'm not seeing all the data points...
    # num_epochs = 1
    max_steps = (dataset_size // per_device_train_batch_size) * num_epochs
    print(f'{num_epochs=} {max_steps=}')
    ## DOESNT WORK num_train_epochs = 3  # TODO: since I decided to do streaming = False and if we collect enough data it's unlikely we see it all hopefully (if we do 3 times seems good given that LLMs are trained to see the data only once this seems a sensible soln, + in the imagenet days things were trained to convergence with no overfitting ref: https://arxiv.org/abs/1801.00173)

    # -- Define custom collate function
    def custom_collate_fn(data: list[dict[str, str]], tokenizer: PreTrainedTokenizer) -> dict[str, torch.Tensor]:
        """ trains on first occurence of eos

        ref: https://discuss.huggingface.co/t/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/45954/13?u=brando 
        ref: https://chat.openai.com/share/02d16770-a1f3-4bf4-8fc2-464286daa8a1
        ref: https://claude.ai/chat/80565d1f-ece3-4fad-87df-364ce57aec15 on when to call .clone()
        """
        # we are training full context length forllama so remove code bellow, if it triesto pad hopefully it throws an error
        # -- Ensure tokenizer has a padding token
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        # -- Extract sequences
        # sequences: list[str] = [example.get("text", "") or "" for example in data]
        sequences: list[str] = []
        for idx, example in enumerate(data):
            # Retrieve the value for "text" from the dictionary or default to an empty string if not present or falsy. ref: https://chat.openai.com/share/bead51fe-2acf-4f05-b8f7-b849134bbfd4
            text: str = example.get("text", "") or ""
            sequences.append(text)
        # -- Tokenize the sequences
        tokenized_data = tokenizer(sequences, padding="max_length", max_length=max_length, truncation=True, return_tensors="pt")
        tokenized_data["labels"] = tokenized_data["input_ids"].clone()  # labels is hardcoded in HF so put it!
        # -- Set the mask value for the first eos_token in each sequence to 1
        eos_token_id = tokenizer.eos_token_id
        for idx, input_ids in enumerate(tokenized_data["input_ids"]):
            # Find all occurrences of eos_token
            eos_positions = (input_ids == eos_token_id).nonzero(as_tuple=True)[0]
            if eos_positions.nelement() > 0:  # Check if eos_token is present
                first_eos_position = eos_positions[0]
                tokenized_data["attention_mask"][idx, first_eos_position] = 1  # Set the mask value to 1

                # Assert that the label for the first occurrence of eos_token is eos_token_id
                assert tokenized_data["labels"][idx, first_eos_position] == eos_token_id, "The label for the first eos_token is incorrect!"

                # For all subsequent occurrences of eos_token, set their labels to -100
                for subsequent_eos_position in eos_positions[1:]:
                    tokenized_data["labels"][idx, subsequent_eos_position] = -100
                    assert tokenized_data["labels"][idx, subsequent_eos_position] == -100, "The label for the subsequent_eos_position incorrect! Should be -100."
        return tokenized_data

    # - Debug before training to see data
    sample_data = train_dataset.select(range(per_device_train_batch_size)) if not isinstance(train_dataset, datasets.iterable_dataset.IterableDataset) else train_dataset.take(per_device_train_batch_size)
    processed_data = custom_collate_fn(sample_data, tokenizer=tokenizer)
    print(f'{processed_data=}')

    # -- Training arguments and trainer instantiation ref: https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments
    output_dir = Path(f'~/data/maf_data/results_{today}/').expanduser() if not debug else Path(f'~/data/maf_data/results/').expanduser()
    print(f'{debug=} {output_dir=} \n {report_to=}')
    training_args = TrainingArguments(
        output_dir=output_dir,  #The output directory where the model predictions and checkpoints will be written.
        # num_train_epochs = num_train_epochs, 
        max_steps=max_steps,  # TODO: hard to fix, see above
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,  # based on alpaca https://github.com/tatsu-lab/stanford_alpaca, allows to process effective_batch_size = gradient_accumulation_steps * batch_size, num its to accumulate before opt update step
        gradient_checkpointing = gradient_checkpointing,  # TODO depending on hardware set to true?
        optim="paged_adamw_32bit",  # David hall says to keep 32bit opt https://arxiv.org/pdf/2112.11446.pdf TODO: if we are using brain float 16 bf16 should we be using 32 bit? are optimizers always fb32?  https://discuss.huggingface.co/t/is-there-a-paged-adamw-16bf-opim-option/51284
        warmup_steps=500,  # TODO: once real training starts we can select this number for llama v2, what does llama v2 do to make it stable while v1 didn't?
        warmup_ratio=0.03,  # copying alpaca for now, number of steps for a linear warmup, TODO once real training starts change? 
        # weight_decay=0.01,  # TODO once real training change?
        weight_decay=0.00,  # TODO once real training change?
        learning_rate = 1e-5,  # TODO once real training change? anything larger than -3 I've had terrible experiences with
        max_grad_norm=1.0, # TODO once real training change?
        lr_scheduler_type="cosine",  # TODO once real training change? using what I've seen most in vision 
        logging_dir=Path('~/data/maf/logs').expanduser(),
        save_steps=2000,  # alpaca does 2000, other defaults were 500
        # logging_steps=250,
        # logging_steps=50,  
        logging_steps=1,
        remove_unused_columns=False,  # TODO don't get why https://stackoverflow.com/questions/76879872/how-to-use-huggingface-hf-trainer-train-with-custom-collate-function/76929999#76929999 , https://claude.ai/chat/475a4638-cee3-4ce0-af64-c8b8d1dc0d90
        report_to=report_to,  # change to wandb!
        fp16=False,  # never ever set to True
        bf16=torch.cuda.get_device_capability(torch.cuda.current_device())[0] >= 8,  # if >= 8 ==> brain float 16 available or set to True if you always want fp32
        evaluation_strategy='steps',
        per_device_eval_batch_size=per_device_eval_batch_size,
        eval_accumulation_steps=eval_accumulation_steps,
        eval_steps=eval_steps,
    )
    # print(f'{training_args=}')
    print(f'{pretrained_model_name_or_path=}')

    # TODO: might be nice to figure our how llamav2 counts the number of token's they've trained on
    print(f'{train_dataset=}')
    print(f'{eval_dataset=}')
    trainer = Trainer(
        model=model,
        args=training_args,  
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=lambda data: custom_collate_fn(data, tokenizer=tokenizer)
    )

All links related to this Q:与此问题相关的所有链接:

Hi would you mind pointing out in which line does the code implemented the subset of evaluation?