Open Jwenyi opened 2 years ago
Hi, I'm not really sure why it's not working. The code for setting a device is here, so you could try to set it explicitly with model.set_device(torch.device("cuda:0"))
? Or maybe just call model.device
, so see what device it's choosing?
The way the code works, is by calling the compute_metric
one each batch of data in the training loop. As you can see from the compute_metric code this call self.to_device
on both the input
and target
moving them to the GPU. The function set_device is responsible for moving the network parameters to the GPU.
I've only tested this on linux machines, but we should probably make sure it also works on windows.
Don't know if any of this helps you debug.
Hi Havard, I made a model with pycox as you told before after making model I used the set device to give the model to GPU my cuda is available but in the training time it doesnt use GPU. This is the code after making model:
model = LogisticHazard(net, tt.optim.Adam(0.01), duration_index=labtrans.cuts)
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device_count())
model.set_device(torch.device("cuda:0"))
callbacks = [tt.cb.EarlyStopping(patience=10)]
epochs = 500
verbose = True
log = model.fit_dataloader(dl_train, epochs, callbacks, verbose, val_dataloader=dl_val)
I wanted to ask you for sending the model to GPU , just we need to use model.set_device(torch.device("cuda:0")) ? I use Ubuntuas OS, python 3.9 and RTX3090 as GPU.
You shouldn't really need to do anything to to use the GPU. My suggestions above were just to debug. Can you provide the output of your print statements? Also, you should be able to check which device of the parameters in the net
.
When you say that the training time doesn't use GPU, how do you check that?
Hi @havakv , In my case, I can see that I have set up the GPU correctly, and then during model training, the GPU memory is being used, but the usage is always 0, can you help me clarify the problem? My environment is Ubuntu.
Code:
net = DenseNet121(spatial_dims=3, in_channels=2, out_channels=labtrans.out_features).cuda()
model = LogisticHazard(net, tt.optim.Adam(0.01), duration_index=labtrans.cuts, device=torch.device('cuda:0'))
print(model.device)
Output:
device(type='cuda', index=0)
Run:
callbacks = [tt.cb.EarlyStopping(patience=5)]
epochs = 50
verbose = True
log = model.fit_dataloader(dl_train, epochs, callbacks, verbose, val_dataloader=dl_test)
Log:
0: [2m:51s / 2m:51s], train_loss: 2.6627, val_loss: 189446.1562
1: [2m:54s / 5m:46s], train_loss: 1.7051, val_loss: 117550.9688
.......
command: nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 |
| N/A 28C P0 60W / 350W | 5209MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
You can see from the above that each epoch takes more than two minutes.
I know why I'm wrong old code:
class getDataset(Dataset):
def __init__(self, datasets):
self.dataset=datasets
self._trans=transforms.Compose(....)
def __getitem__(self, index):
return self._trans(self.dataset[index])
......
dataset_train = getDataset(....)
dl_train = DataLoader(dataset_train, ...)
new code:
trans=transforms.Compose(....)
data=trans(......)
class getDataset(Dataset):
def __init__(self, datasets):
self.dataset=datasets
def __getitem__(self, index):
return self.dataset[index]
......
dataset_train = getDataset(data)
dl_train = DataLoader(dataset_train, ...)
Hi @havakv
I'm now trying to train
deepsurv
withpycox
. However, I noted thatdeepsurv
is working with CPU and not with GPU, even if I set the parameterdevice = None
. I would like to know how should I letpycox
worked with GPU? Maybe there is a way to load the input data and model directly to the GPU, likemodel.to(device)
? I worked with WIN10, python 3.8.12, jupyter notebook, torch+cuda11.3, and NVIDIA RTX 3060. And GPU works well when I usetorch
(see below). My codes is also attached, and I would appreciate any suggestion you can send me.Best, Wenyi Jin
My codes' like (I use optuna for hyperparameters tuning)