ligoudaner377 / font_translator_gan

88 stars 9 forks source link

AssertionError: invalid device id #5

Closed manhvela closed 2 years ago

manhvela commented 2 years ago

Hello, amazing job there!

I'm kind of new in deep learning. I tried to run your code but I get this error:

----------------- Options --------------- batch_size: 256
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
dataroot: ./datasets/font [default: None] dataset_mode: font
direction: english2chinese
dis_2: True
display_env: main
display_freq: 51200
display_id: 1
display_ncols: 10
display_port: 8097
display_server: http://localhost
display_winsize: 64
epoch: latest
epoch_count: 1
gan_mode: hinge
gpu_ids: 0,1
init_gain: 0.02
init_type: normal
isTrain: True [default: None] lambda_L1: 100.0
lambda_content: 1.0
lambda_style: 1.0
load_iter: 0 [default: 0] load_size: 64
lr: 0.0002
lr_decay_iters: 50
lr_policy: linear
max_dataset_size: inf
model: font_translator_gan
n_epochs: 10
n_epochs_decay: 10
n_layers_D: 3
name: test_new_dataset [default: experiment_name] ndf: 64
netD: basic_64
netG: FTGAN_MLAN
ngf: 64
no_dropout: True [default: False] no_html: False
norm: batch
num_threads: 4
phase: train
pool_size: 0
print_freq: 51200
save_by_iter: False
save_epoch_freq: 5
save_latest_freq: 5000000
style_channel: 6
suffix:
update_html_freq: 51200
use_spectral_norm: True
verbose: False
----------------- End ------------------- dataset [FontDataset] was created The number of training images = 753637 Traceback (most recent call last): File "train.py", line 14, in model = create_model(opt) # create a model given opt.model and other options File "/home/dufra/Desktop/gan/font_translator_gan/models/init.py", line 65, in create_model instance = model(opt) File "/home/dufra/Desktop/gan/font_translator_gan/models/font_translator_gan_model.py", line 42, in init self.netG = networks.define_G(self.style_channel+1, 1, opt.ngf, opt.netG, opt.norm, File "/home/dufra/Desktop/gan/font_translator_gan/models/networks.py", line 183, in define_G return init_net(net, init_type, init_gain, gpu_ids) File "/home/dufra/Desktop/gan/font_translator_gan/models/networks.py", line 124, in init_net net = torch.nn.DataParallel(net, gpu_ids) # multi-GPUs File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 142, in init _check_balance(self.device_ids) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 23, in _check_balance dev_props = _get_devices_properties(device_ids) File "/home/dufra/.local/lib/python3.8/site-packages/torch/_utils.py", line 464, in _get_devices_properties return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids] File "/home/dufra/.local/lib/python3.8/site-packages/torch/_utils.py", line 464, in return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids] File "/home/dufra/.local/lib/python3.8/site-packages/torch/_utils.py", line 447, in _get_device_attr return get_member(torch.cuda) File "/home/dufra/.local/lib/python3.8/site-packages/torch/_utils.py", line 464, in return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids] File "/home/dufra/.local/lib/python3.8/site-packages/torch/cuda/init.py", line 359, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id

Could you help me?

Thanks

ligoudaner377 commented 2 years ago

Hi! @manhvela , thanks for your interest in the project. It seems that the current config is not compatible with your device. How many GPUs do you have? the default setting is now using 2 GPUs. If you have only one GPU, setting [gpu_ids] to 0 may fix your problem. https://github.com/ligoudaner377/font_translator_gan/blob/9e1aaf03b3edbacee0023607f60cc4b2a155cc8b/options/base_options.py#L24

manhvela commented 2 years ago

@ligoudaner377 thank you for the fast reply! I have one GPU (Nvidia GTX 1060). I tried this but a new error came up, something to do with my GPU I guess:


Setting up a new session... create web directory ./checkpoints/test_new_dataset/web... Traceback (most recent call last): File "train.py", line 33, in model.optimize_parameters() # calculate loss functions, get gradients, update network weights
File "/home/dufra/Desktop/gan/font_translator_gan/models/font_translator_gan_model.py", line 129, in optimize_parameters self.forward() # compute fake images: G(A) File "/home/dufra/Desktop/gan/font_translator_gan/models/font_translator_gan_model.py", line 80, in forward self.generated_images = self.netG((self.content_images, self.style_images)) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/dufra/Desktop/gan/font_translator_gan/models/networks.py", line 952, in forward style_features = self.style_encoder(style_images.view(-1, 1, 64, 64)) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/home/dufra/Desktop/gan/font_translator_gan/models/networks.py", line 799, in forward return self.model(inp) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward return F.batch_norm( File "/home/dufra/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 2282, in batch_norm return torch.batch_norm( RuntimeError: CUDA out of memory. Tried to allocate 1.50 GiB (GPU 0; 5.93 GiB total capacity; 2.57 GiB already allocated; 915.94 MiB free; 3.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

btw I tried to run it in Colab GPU runtime and I get the same error

ligoudaner377 commented 2 years ago

@manhvela Maybe try a smaller batch size? A large batch size may cause some memory issues. Change this config to adjust it. (start with a small number e.g., 16, 32) https://github.com/ligoudaner377/font_translator_gan/blob/9e1aaf03b3edbacee0023607f60cc4b2a155cc8b/models/font_translator_gan_model.py#L12

manhvela commented 2 years ago

@ligoudaner377 it's working, thank you very much!