lxe / simple-llm-finetuner

Simple UI for LLM Model Finetuning
MIT License
2.05k stars 132 forks source link

In trainer.py, ignore the last token is not suitable for all situations. #53

Open HCTsai opened 1 year ago

HCTsai commented 1 year ago

In trainer.py, ignore the last token is not suitable for all situations.

    def tokenize_sample(self, item, max_seq_length, add_eos_token=True):
        assert self.tokenizer is not None
        result = self.tokenizer(
            item["text"],
            truncation=True,
            max_length=max_seq_length,
            padding="max_length",
        )

       # ignore the last token [:-1]
        result = {
            "input_ids": result["input_ids"][:-1],
            "attention_mask": result["attention_mask"][:-1],
        }

https://github.com/lxe/simple-llm-finetuner/blob/3c3ae84e5dee5a1d40f17e5567938dfdffce9d16/trainer.py#LL150C9-L153C10

If the user of web UI using custom dataset. they will not know the last token of training data is truncated. And the prediction results go unexpected.