havakv / pycox

Survival analysis with PyTorch
BSD 2-Clause "Simplified" License
780 stars 180 forks source link

Could not use GPU for pycox #130

Open Jwenyi opened 2 years ago

Jwenyi commented 2 years ago

Hi @havakv
I'm now trying to train deepsurv with pycox. However, I noted that deepsurv is working with CPU and not with GPU, even if I set the parameter device = None. I would like to know how should I let pycox worked with GPU? Maybe there is a way to load the input data and model directly to the GPU, like model.to(device)? I worked with WIN10, python 3.8.12, jupyter notebook, torch+cuda11.3, and NVIDIA RTX 3060. And GPU works well when I use torch (see below). My codes is also attached, and I would appreciate any suggestion you can send me.

Best, Wenyi Jin

print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device_count())
display(torch.rand(5,3))

True
0
1
tensor([[0.6169, 0.2233, 0.2867],
        [0.2883, 0.3808, 0.3523],
        [0.6437, 0.2000, 0.9605],
        [0.5560, 0.2635, 0.1194],
        [0.3965, 0.8717, 0.4040]])

My codes' like (I use optuna for hyperparameters tuning)

net = tt.practical.MLPVanilla(
            in_features=x_train.shape[1],
            num_nodes=self.study.best_params['num_nodes'],
            out_features=1,
            batch_norm=self.__deepsurv_params['batch_norm'],
            dropout=self.study.best_params['dropout'],
            activation=act_fun_,
            w_init_=initializer_,
            output_bias=self.__deepsurv_params['output_bias'])

model = CoxPH(
            net=net,
            optimizer=tt.optim.Adam(
                lr=self.study.best_params['learning_rate'], weight_decay=self.study.best_params['l2']),
            device=None)

callbacks = [tt.callbacks.EarlyStopping()]

log = model.fit(
            input=x_train,
            target=y_train,
            batch_size=self.study.best_params['batch_size'],
            epochs=self.__deepsurv_params['epochs'],
            callbacks=callbacks,
            verbose=False)
havakv commented 2 years ago

Hi, I'm not really sure why it's not working. The code for setting a device is here, so you could try to set it explicitly with model.set_device(torch.device("cuda:0"))? Or maybe just call model.device, so see what device it's choosing?

The way the code works, is by calling the compute_metric one each batch of data in the training loop. As you can see from the compute_metric code this call self.to_device on both the input and target moving them to the GPU. The function set_device is responsible for moving the network parameters to the GPU.

I've only tested this on linux machines, but we should probably make sure it also works on windows.

Don't know if any of this helps you debug.

mahootiha-maryam commented 2 years ago

Hi Havard, I made a model with pycox as you told before after making model I used the set device to give the model to GPU my cuda is available but in the training time it doesnt use GPU. This is the code after making model:

model = LogisticHazard(net, tt.optim.Adam(0.01), duration_index=labtrans.cuts)
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device_count())
model.set_device(torch.device("cuda:0"))
callbacks = [tt.cb.EarlyStopping(patience=10)]
epochs = 500
verbose = True
log = model.fit_dataloader(dl_train, epochs, callbacks, verbose, val_dataloader=dl_val)

I wanted to ask you for sending the model to GPU , just we need to use model.set_device(torch.device("cuda:0")) ? I use Ubuntuas OS, python 3.9 and RTX3090 as GPU.

havakv commented 2 years ago

You shouldn't really need to do anything to to use the GPU. My suggestions above were just to debug. Can you provide the output of your print statements? Also, you should be able to check which device of the parameters in the net .

When you say that the training time doesn't use GPU, how do you check that?

Minxiangliu commented 1 year ago

Hi @havakv , In my case, I can see that I have set up the GPU correctly, and then during model training, the GPU memory is being used, but the usage is always 0, can you help me clarify the problem? My environment is Ubuntu.

Code:

net = DenseNet121(spatial_dims=3, in_channels=2, out_channels=labtrans.out_features).cuda()
model = LogisticHazard(net, tt.optim.Adam(0.01), duration_index=labtrans.cuts, device=torch.device('cuda:0'))
print(model.device)

Output: device(type='cuda', index=0)

Run:

callbacks = [tt.cb.EarlyStopping(patience=5)]
epochs = 50
verbose = True
log = model.fit_dataloader(dl_train, epochs, callbacks, verbose, val_dataloader=dl_test)

Log:

0:  [2m:51s / 2m:51s],      train_loss: 2.6627, val_loss: 189446.1562
1:  [2m:54s / 5m:46s],      train_loss: 1.7051, val_loss: 117550.9688
                                                          .......

command: nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   28C    P0    60W / 350W |   5209MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

You can see from the above that each epoch takes more than two minutes.

Minxiangliu commented 1 year ago

I know why I'm wrong old code:

class getDataset(Dataset):
   def __init__(self, datasets):
       self.dataset=datasets
       self._trans=transforms.Compose(....)

    def __getitem__(self, index):     
        return self._trans(self.dataset[index])
     ......

dataset_train = getDataset(....)
dl_train = DataLoader(dataset_train, ...)

new code:

trans=transforms.Compose(....)
data=trans(......)

class getDataset(Dataset):
   def __init__(self, datasets):
       self.dataset=datasets

    def __getitem__(self, index):     
        return self.dataset[index]
     ......

dataset_train = getDataset(data)
dl_train = DataLoader(dataset_train, ...)