Closed dangalea closed 1 year ago
Bumoing this @matthewdeng @justinvyu
I changed my eval_model()
method to not process the testing data as whole. Instead, this now looks like this:
def eval_model(model, val_loader, loss_fn, loss_config, weights):
running_vloss = 0.
all_miou_bg = []
all_miou_ar = []
with torch.no_grad():
batch_iter = enumerate(val_loader)
pool = ray_pool(10)
for i, data in batch_iter:
val_preds = []
val_labels = []
inputs, labels = data
inputs = inputs.cuda()
labels = labels.cuda()
outputs = model(inputs.float())
if loss_config != "CE":
new_labels = torch.zeros_like(outputs)
new_labels[:, 0][labels==0] = 1
new_labels[:, 1][labels==1] = 1
labels = new_labels.cuda()
if loss_config == "W-MSE":
vloss = weighted_mse_loss(outputs.float(), labels.float(), weights).item()
elif loss_config == "W-MAE":
vloss = weighted_mae_loss(outputs.float(), labels.float(), weights).item()
elif loss_config in ["BCE", "MSE", "MAE"]:
vloss = loss_fn(outputs.float(), labels.float()).item()
else:
vloss = loss_fn(outputs, labels.long()).item()
running_vloss += vloss * len(inputs) / len(val_loader)
for output in outputs.cpu().detach().numpy():
val_preds.append(output)
for label in labels.cpu().detach().numpy():
val_labels.append(label)
miou_background, miou_ar = mIoU(pool, val_preds, val_labels)
for i in miou_background:
all_miou_bg.append(i)
for i in miou_ar:
all_miou_ar.append(i)
pool.close()
return np.mean(all_miou_bg), np.mean(all_miou_ar), running_vloss
def mIoU(pool, preds, labels):
inputs = []
for i in range(len(preds)):
inputs.append([preds[i], labels[i]])
results = pool.starmap(iou_case, inputs, chunksize=1)
iou_background = []
iou_ar = []
for result in results:
iou_bg_case, iou_ar_case = result
iou_background.append(iou_bg_case)
iou_ar.append(iou_ar_case)
return iou_background, iou_ar
This seems to have solved the problem as now the whole test dataset is not being stored but is being split up.
What happened + What you expected to happen
I am trying to run a HPO run on 30 nodes of 2 GPUs each, i.e. a total of 60 GPUs and each node has 72 CPUs. Unfortunately, I am running into OOM issues from Ray.
I am logging my results to wandb and from there I can see that the Process Memory Available (non-swap) and System Memory Utilization are going to zero, so it is not a GPU memory issue. When running any of the failed runs as a standalone, I do not get any issues, so I think that Ray is processing data (via my dataloaders) differently than expected. Otherwise, it has some overhead which I am not accounting for. I have tried varying the batch size but that doesn't solve the problem. I have tried running with and without the
max_concurrent
flag but the issue still persists. That being said, Ray's object_store_memory (obtained viaray status
) is non-zero (but <150GB) when the flag is set, otherwise it is zero. Would you be able to help?Versions / Dependencies
Reproduction script
Issue Severity
High: It blocks me from completing my task.