Open Veteranback opened 1 month ago
我在用pytorch迁移到mindspore上来,目前发现mindspore需要自己计算出梯度,通过opt(grad)进行更新。但是在实际训练中,我发现程序占据显存无限制的增长,个人推测是梯度没有更新导致的。
模型的运算函数如下:
def construct( self, input_ids=None, attention_mask=None, token_type_ids=None, detect_labels=None, correct_labels=None ): hidden_states = self.bert( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0] detect_outputs = self.tag_detect_projection_layer(hidden_states) correct_outputs = self.tag_label_projection_layer(hidden_states) result = { "detect_outputs": detect_outputs, "correct_outputs": correct_outputs, "detect_loss": None, "correct_loss": None, "loss": None, } loss = None if detect_labels is not None and correct_labels is not None: detect_loss = self._detect_criterion(detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1)) correct_loss = self._correct_criterion( correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1)) loss = detect_loss + correct_loss result["detect_loss"] = detect_loss result["correct_loss"] = correct_loss elif detect_labels is not None: loss = self._detect_criterion( detect_outputs.view(-1, self.args.detect_vocab_size), detect_labels.view(-1)) elif correct_labels is not None: loss = self._correct_criterion( correct_outputs.view(-1, self.args.correct_vocab_size), correct_labels.view(-1)) result["loss"] = loss #output=result return result
训练时定义的函数如下:
------------------------------------ def foward_fn(self,batch_data): detect_labels = batch_data[3] correct_labels = batch_data[4] output = self.model(batch_data[0], batch_data[1], batch_data[2], detect_labels, correct_labels) return output -------------------------------------------- self.optimizer=AdamW(self.model.trainable_params(), lr=args.learning_rate, eps=args.adam_epsilon) grad_fn=mindspore.value_and_grad(self.foward_fn,None,self.optimizer.parameters,has_aux=False) for epoch in range(1, self.epochs + 1): for step, batch_data in enumerate(self.train_loader): output,grad=grad_fn(batch_data) loss=output['loss'].mean() mindspore.ops.clip_by_norm(x=grad, max_norm=self.args.max_grad_norm) self.optimizer(grad)
请问一下是不是我的代码存在问题呢?
我在用pytorch迁移到mindspore上来,目前发现mindspore需要自己计算出梯度,通过opt(grad)进行更新。但是在实际训练中,我发现程序占据显存无限制的增长,个人推测是梯度没有更新导致的。
模型的运算函数如下:
训练时定义的函数如下:
请问一下是不是我的代码存在问题呢?