mindspore-ai / mindspore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
https://gitee.com/mindspore/mindspore
Apache License 2.0
4.26k stars 705 forks source link

mindspore训练模型时显存无限制增加 #302

Open Veteranback opened 1 month ago

Veteranback commented 1 month ago

我在用pytorch迁移到mindspore上来,目前发现mindspore需要自己计算出梯度,通过opt(grad)进行更新。但是在实际训练中,我发现程序占据显存无限制的增长,个人推测是梯度没有更新导致的。

模型的运算函数如下:

 def construct(
            self,
            input_ids=None,
            attention_mask=None,
            token_type_ids=None,
            detect_labels=None,
            correct_labels=None
    ):

        hidden_states = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids)[0]
        detect_outputs = self.tag_detect_projection_layer(hidden_states)
        correct_outputs = self.tag_label_projection_layer(hidden_states)

        result = {
            "detect_outputs": detect_outputs,
            "correct_outputs": correct_outputs,
            "detect_loss": None,
            "correct_loss": None,
            "loss": None,
        }

        loss = None
        if detect_labels is not None and correct_labels is not None:
            detect_loss = self._detect_criterion(detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1))
            correct_loss = self._correct_criterion(
                correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1))
            loss = detect_loss + correct_loss
            result["detect_loss"] = detect_loss
            result["correct_loss"] = correct_loss

        elif detect_labels is not None:
            loss = self._detect_criterion(
                detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1))
        elif correct_labels is not None:
            loss = self._correct_criterion(
                correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1))

        result["loss"] = loss
        #output=result
        return result

训练时定义的函数如下:


------------------------------------
def foward_fn(self,batch_data):
        detect_labels = batch_data[3]
        correct_labels = batch_data[4]
        output = self.model(batch_data[0],
                            batch_data[1],
                            batch_data[2],
                            detect_labels,
                            correct_labels)
        return output
--------------------------------------------
self.optimizer=AdamW(self.model.trainable_params(), lr=args.learning_rate, eps=args.adam_epsilon)
grad_fn=mindspore.value_and_grad(self.foward_fn,None,self.optimizer.parameters,has_aux=False)
for epoch in range(1, self.epochs + 1):
    for step, batch_data in enumerate(self.train_loader):
        output,grad=grad_fn(batch_data)
        loss=output['loss'].mean()
        mindspore.ops.clip_by_norm(x=grad, max_norm=self.args.max_grad_norm)
        self.optimizer(grad)

请问一下是不是我的代码存在问题呢?