loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
179 stars 22 forks source link

Question (not issue) related to dataset generation #14

Closed insop closed 11 months ago

insop commented 1 year ago

Hi @loubnabnl

Thank you so much for this nice repo for running finetuning.

I have one question and did not find a better way to communicate, so feel free to answer and then close this issue.

In the following code, input_ids and labels are the same for supervised fine tuning. Is there somewhere in the model training parameter that knows it is a causal LM training, so it will shift the labels by one, so that input_ids and labels become a next token prediction task?

...
            for example in examples:
                self.current_size += 1
                yield {
                        "input_ids": torch.LongTensor(example),
                        "labels": torch.LongTensor(example),
                    }
loubnabnl commented 1 year ago

Yes the labels are shifted in transformers here

insop commented 11 months ago

Yes the labels are shifted in transformers here

Thank you @loubnabnl !