Closed klock18 closed 4 years ago
UPDATE: I found this correction on Kaggle but now I can only get through ~7 training steps before CUDA runs out of memory, any advice for that?
target_res = {}
boxes = [target['boxes'].to(self.device).float() for target in targets]
labels = [target['labels'].to(self.device).float() for target in targets]
target_res['bbox'] = boxes
target_res['cls'] = labels
self.optimizer.zero_grad()
#
# targets
outputs = self.model(images, target_res)
loss = outputs['loss']
loss.backward()
# with amp.scale_loss(loss, self.optimizer) as scaled_loss:
# scaled_loss.backward()
summary_loss.update(loss.detach().item(), batch_size)
@klock18 training these networks is quite GPU memory intensive, and the memory consumption takes a number of steps to stabilize. Only advice I have for that is definitely don't disable AMP as it'll roughly half the memory use, and then reduce the batch size or possibly lower the resolution of your model config.
Hi!
I am having the same issue as: https://github.com/rwightman/efficientdet-pytorch/issues/42
I tried your fix but don't think i did it in the correct way:
Because I am getting this error: Traceback (most recent call last): File "train_baseline.py", line 473, in
run_training()
File "train_baseline.py", line 456, in run_training
fitter.fit(train_loader, val_loader)
File "train_baseline.py", line 294, in fit
summary_loss = self.train_one_epoch(train_loader)
File "train_baseline.py", line 363, in train_one_epoch
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
File "/home/loc103/anaconda3/envs/stac/lib/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/loc103/anaconda3/envs/stac/lib/python3.7/site-packages/apex/amp/handle.py", line 113, in scale_loss
yield (loss.float())*loss_scale
AttributeError: 'str' object has no attribute 'float'
I appreciate any help you could give me!