My training errors - Githubissues

JinMengKang commented 5 years ago

Thank you for your excellent work. I encountered two problems: First, when I was training on the Places dataset, when the training went to 1000 iter s, the model was not saved, and the training did not continue to go down any more. qq 20190224203021 The second problem is that when I use my own data set, there are some problems, like the picture loading is not successful, but my flist file is generated correctly and the path of the picture is correct.This error will not occur when using the data set in your paper at that time. _20190224203033 I hope you can help me.

knazeri commented 5 years ago

I have no idea why the training stops, I think you can debug the code here: This is where we save the models after SAVE_INTERVAL intervals. https://github.com/knazeri/edge-connect/blob/97c28c62ac54a59212cc9db4e78f36c5436c0b72/src/edge_connect.py#L205-L206

As for your second question, it's something to do with the training set path, if you are sure that the file list is correct, then make sure scipy's imread method is able to load your images here: https://github.com/knazeri/edge-connect/blob/97c28c62ac54a59212cc9db4e78f36c5436c0b72/src/dataset.py#L57

To make sure your file-list is correct, you can use the training folder path instead of file-list in your configuration. When the value is a directory path, we load all .png and .jpg images within that folder, as you can see here: https://github.com/knazeri/edge-connect/blob/97c28c62ac54a59212cc9db4e78f36c5436c0b72/src/dataset.py#L178-L179

wangping1408 commented 5 years ago

Screenshot from 2019-03-29 11-01-45 Thank you for your excellent work. I encountered some problems: This error occurred when I was training, but my image path was correct. I am using the celebA dataset. The original image size is 178218, 202599 images. The size of the image in the configuration file is 256256. I want to know if the image needs to be pre-processed before training？In the training, you need to input edge_flist, how is the edge generated here? I hope you can help me!

knazeri commented 5 years ago

@wangping1408 You don't need to do any pre-processing step! All pre-processing steps, including resizing to the proper size, is implemented in our code! The edge_flist that you mentioned is for situations where you want to use some external edge generation schemes (such as another deep-learning based edge generation). Technically you don't need edge_flist unless you want to experiment with different edge detection, in which you need to save corresponding edges of your entire training set separately!

I'm not sure what error you are receiving here! It looks like the dataloader failed to load some of your images!

wangping1408 commented 5 years ago

Thank you for your help, now I can get the program trained. Batch_size is set to 8, model is set to 3, using a 1080ti gpu, the training speed is too slow, I want to use multi-gpu training, where should I modify it? I tried to modify GPU: [0] to GPU: [0,1] in the configuration file config.yml, I have two free gpus numbered 0 and 1, but there was a problem during training. Screenshot from 2019-04-05 21-14-16 Shows that I am using an illegal gpu, but these are indeed usable gpus. I want to know if you have better suggestions and can help me improve my training speed.

knazeri commented 5 years ago

@wangping1408 That's exactly how you specify muli-gpu! Can you make sure that you can train separately on GPU: [0] and then GPU: [1]? Because the error you are receiving is Invalid device id thrown by Pytorch!

wangping1408 commented 5 years ago

Yes, I can train normally when I specify 0 and 1gpu respectively, but when using multiple gpu, there is a problem.

knazeri commented 5 years ago

Can you print the value of config.GPU before this line? https://github.com/knazeri/edge-connect/blob/698509d1ac1d7a40310139f9e4d70410b3d734e4/src/models.py#L67

Because the error is thrown by the internal torch.cuda.get_device_properties method, which only throws an error with invalid GPU id!

wangping1408 commented 5 years ago

Screenshot from 2019-04-07 10-04-22 I made changes in this place, and I was able to print out the value of config.GPU .

wangping1408 commented 5 years ago

I found a strange phenomenon, no matter which two config.GPU[] in config.yml I changed, the printouts are 0 and 1.

knazeri commented 5 years ago

@wangping1408 You are printing a list of device_count(), given that you have 2 GPU on your system, it'll always print [0, 1]. Instead, print out config.GPU!

wangping1408 commented 5 years ago

oh, it is my fault.I reprinted the config.GPU[].However, it is still show that invalid device id Screenshot from 2019-04-08 08-33-11

knazeri commented 5 years ago

@wangping1408 It is printing [0, 2] but you have only two GPUs on your machine, that means 2 is an invalid GPU id. Did you set [0, 2] in your config file?

wangping1408 commented 5 years ago

I have 4 GPUS on my machine。And set up the config.yml config.GPU[0, 2].This two gpus are free and can be used. Screenshot from 2019-04-08 10-59-34 Screenshot from 2019-04-08 11-00-33

knazeri commented 5 years ago

@wangping1408 Yea I see you have 4 GPUs, but I still don't know why device_count() was returning 2! Can you make sure the following Python snippet runs?

import torch
for i in range(4):
  print(torch.cuda.get_device_properties(i))

wangping1408 commented 5 years ago

I tested this code on your server and the result is as follows： Screenshot from 2019-04-13 21-07-47

wangping1408 commented 5 years ago

I retrained again, put the GPU[1,3] and batch_size:2, and then prompted such an error.all tensors must be on devices[0]? Screenshot from 2019-04-13 21-22-30

knazeri commented 5 years ago

@wangping1408 This is really strange! I am able to train the model on 4 Titan V GPUs right now! What's more strange is that the first error is from PyTorch complaining that the GPU is invalid! I'll try to test it on a different server and see if I get a similar error message!

ZYYDJ commented 4 years ago

Thank you for your excellent work. I encountered two problems: First, when I was training on the Places dataset, when the training went to 1000 iter s, the model was not saved, and the training did not continue to go down any more. The second problem is that when I use my own data set, there are some problems, like the picture loading is not successful, but my flist file is generated correctly and the path of the picture is correct.This error will not occur when using the data set in your paper at that time. I hope you can help me. @JinMengKang Have you sovled your first problem? I met the same problem. Could you tell me how to fix it? THX!!!

LeonCurry commented 3 years ago

Thank you for your excellent work. I encountered some problems: This error occurred when I was training, but my image path was correct. I am using the celebA dataset. The original image size is 178_218, 202599 images. The size of the image in the configuration file is 256_256. I want to know if the image needs to be pre-processed before training？In the training, you need to input edge_flist, how is the edge generated here? I hope you can help me!

@wangping1408 Hello, have you solved this problem? I also encountered this problem.

Ghost0405 commented 1 year ago

Thank you for your excellent work. I encountered some problems: This error occurred when I was training, but my image path was correct. I am using the celebA dataset. The original image size is 178_218, 202599 images. The size of the image in the configuration file is 256_256. I want to know if the image needs to be pre-processed before training？In the training, you need to input edge_flist, how is the edge generated here? I hope you can help me!

@wangping1408 Hello, have you solved this problem? I also encountered this problem.

Hello, have you solved this problem? I also encountered this problem.

knazeri / edge-connect

My training errors #48