hzwer / ICCV2019-LearningToPaint

ICCV2019 - Learning to Paint With Model-based Deep Reinforcement Learning
MIT License
2.24k stars 312 forks source link

training #4

Closed ak9250 closed 5 years ago

ak9250 commented 5 years ago

when running train.py with celebA it automatically gets interrupted

loaded 200000 images finish loading data, 197999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13 /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:1332: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") /content/LearningToPaint/baseline/DRL/ddpg.py:158: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). s0 = torch.tensor(self.state, device='cpu') /content/LearningToPaint/baseline/DRL/ddpg.py:161: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). s1 = torch.tensor(state, device='cpu') ^C

also I am on gpu

ak9250 commented 5 years ago

tried it on a v100 instead of a k80 and it seems to be training now

64: steps:2600 interval_time:0.91 train_time:3.61

65: steps:2640 interval_time:0.93 train_time:3.62

66: steps:2680 interval_time:0.88 train_time:3.67

67: steps:2720 interval_time:0.89 train_time:3.61

68: steps:2760 interval_time:0.87 train_time:3.61

do I stop it manually with interrupt and how do I do inference once i have the trained models? So I have actor.pkl, critic.pkl, and wgan.pkl

ak9250 commented 5 years ago

@hzwer is there currently a way to do test a trained model is does that need to be added?

hzwer commented 5 years ago

I submitted a code for a single image test. If you have a trained model, you can read the README to execute the test.

hzwer commented 5 years ago

Note that the model is not stored in real time, and the model is only stored once when validation is performed.

ak9250 commented 5 years ago

@hzwer thanks for the test code, I am still testing it seems is does not recognize the k80 gpu on google colab? gives this error Traceback (most recent call last): File "baseline/test.py", line 42, in actions = actor(torch.cat([canvas, img, stepnum, coord], 1)) RuntimeError: Expected object of backend CUDA but got backend CPU for sequence element 2 in sequence argument at position #1 'tensors' or actually might be a pytorch issue will try it on a p100 now also is there a way to generate the painting process either as a sequence of images or gif or mp4?

ak9250 commented 5 years ago

@hzwer I keep getting this error when trying to run test.py on p100 Traceback (most recent call last): File "baseline/test.py", line 10, in from DRL.ddpg import decode File "/kaggle/working/LearningToPaint/baseline/DRL/ddpg.py", line 10, in from DRL.wgan import File "/kaggle/working/LearningToPaint/baseline/DRL/wgan.py", line 10, in from utils.util import ModuleNotFoundError: No module named 'utils.util'

hzwer commented 5 years ago
  1. I can run this code in my mac without gpu and I don't know how to deal with the first error. As for gif or mp4, I can think about how to make a better demo tomorrow.
  2. Although I don't know why you can't import util.py, I've adjusted the structure of the code and wish you to success.
ak9250 commented 5 years ago

@hzwer thanks for the update 1) not sure why its having trouble with google colab notebook here running on a k80 https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806 2) Great, I wanted to reproduce the painting process so a sequence of generated images would work as well also now I am getting this error running on a p100 in kaggle torch.Size([1, 3, 128, 128]) torch.Size([1, 3, 128, 128]) Traceback (most recent call last): File "baseline/test.py", line 59, in actions = actor(torch.cat([canvas, img, stepnum, coord], 1)) RuntimeError: Expected a Tensor of type torch.cuda.FloatTensor but found a type torch.FloatTensor for sequence element 2 in sequence argument at position #1 'tensors'

hzwer commented 5 years ago
  1. I will try to solve it and it's downloading the dataset.
  2. I forgot to test it in gpu environment so there are some type mistakes. I fixed it.
hzwer commented 5 years ago

Thank you for inspiring me to complete these tasks, I was in a procrastination.

ak9250 commented 5 years ago

1) this line is installing pytorch 0.4.1 !pip install https://download.pytorch.org/whl/cu80/torch-0.4.1-cp36-cp36m-linux_x86_64.whl in the colab notebook 2) thanks will try it now in colab and kaggle, also if you can include a colab notebook in the repo that includes all these steps that is much easier for anyone to reproduce the results in any environment. Will probably save time in the future with any issues.

ak9250 commented 5 years ago

@hzwer just tested it in kaggle and google colab and it is working now step 30, L2Loss = 0.047712087631225586 step 31, L2Loss = 0.04762433469295502 step 32, L2Loss = 0.04767460748553276 step 33, L2Loss = 0.04763370752334595 step 34, L2Loss = 0.047628842294216156 step 35, L2Loss = 0.047565679997205734 step 36, L2Loss = 0.04759366065263748 step 37, L2Loss = 0.04759349673986435 step 38, L2Loss = 0.047535862773656845 step 39, L2Loss = 0.04747484624385834

test5 generated

test and generated images, is there a way to get the painting process for generated images at each step?

hzwer commented 5 years ago

image Why it always interrupt itself after step 0?

ak9250 commented 5 years ago

I have not tried training yet in google colab with the new changes but I was getting that error before than switched to a p100 in kaggle and it was fixed

hzwer commented 5 years ago

I will study how to make such a video tomorrow. We are already late at night. Good night! I think you can try this pretrained actor although it's not good enough. https://drive.google.com/file/d/1d4LJrzZcvDsIpLOIuDKRinoDe5-vL139/view?usp=sharing

ak9250 commented 5 years ago

thank you, I am downloading the celebA dataset again to try training with the new fixes in colab. Once I have a working colab notebook for training I will share it and I and also will share the kaggle notebook soon.

ak9250 commented 5 years ago

I think I have a idea for getting the gif to just combine the images at each step but I want to get 512x512 currently it is 128x128 tried changing test.py width and parameters to 512 from 128 but still get this error File "baseline/test.py", line 58, in actions = actor(torch.cat([canvas, img, stepnum, coord], 1)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/content/LearningToPaint/baseline/DRL/actor.py", line 116, in forward x = self.fc(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 55, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1024, in linear return torch.addmm(bias, input, weight.t()) RuntimeError: size mismatch, m1: [1 x 8192], m2: [512 x 65] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:249 I got the result by doing some post processing elsewhere the gif looks really cool and celeba model works on paintings of human faces as well I think if the strokes could be smaller though it could be better, also to start on a white instead of black pixel background

hzwer commented 5 years ago

Sorry, I found that using a higher resolution canvas does not improve image quality due to the bias between the neural renderer and the original renderer. If needed, we should use a higher resolution neural renderer for training.

ak9250 commented 5 years ago

ok I just upscaled the output so that is not much of a issue also how do you get "restricting the transparency of strokes, we can get paintings with different stroke effects, such as ink painting and oil painting as shown in Figure 6"

ak9250 commented 5 years ago

also around 763 steps i am getting Cuda out of memory error in colab although it shouldnt be step 763, L2Loss = 0.009716738946735859 Traceback (most recent call last): File "baseline/test.py", line 58, in actions = actor(torch.cat([canvas, img, stepnum, coord], 1)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/content/LearningToPaint/baseline/DRL/actor.py", line 110, in forward x = self.layer1(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/content/LearningToPaint/baseline/DRL/actor.py", line 49, in forward out = F.relu(self.bn1(self.conv1(x))) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/batchnorm.py", line 66, in forward exponential_average_factor, self.eps) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1254, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA error: out of memory

hzwer commented 5 years ago
  1. If you want different stroke effects, you should train some other renders, or add the restricts in DDPG.py->decode() during DDPG training. I can upload training code for different renderers soon, but you need to train different agents actually.
  2. Maybe their memory is slightly smaller, you can set batchsize=80 to avoid insufficient memory.
hzwer commented 5 years ago

I can give you some different stroke renderers and trained agents in two days.

ak9250 commented 5 years ago

ok thanks!

ak9250 commented 5 years ago

@hzwer did you figure out what the problem could be that it is stopping itself in colab for training loaded 120000 images loaded 130000 images loaded 140000 images loaded 150000 images loaded 160000 images loaded 170000 images loaded 180000 images loaded 190000 images loaded 200000 images finish loading data, 197999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13

0: steps:40 interval_time:7.21 train_time:0.00

^C I still get the utils problem in kaggle so trying to figure out that

I think its a RAM issue colab has 12 GB ram once I changed loading to 100000 in env.py it started training further then stopped loaded 100000 images finish loading data, 97999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13

0: steps:40 interval_time:6.48 train_time:0.00

1: steps:80 interval_time:5.27 train_time:0.00

2: steps:120 interval_time:5.23 train_time:0.00

3: steps:160 interval_time:5.29 train_time:0.00

4: steps:200 interval_time:5.27 train_time:0.00

5: steps:240 interval_time:5.32 train_time:0.00

got it working by changing loading to 10000

loaded 10000 images finish loading data, 7999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13

0: steps:40 interval_time:6.49 train_time:0.00

1: steps:80 interval_time:5.28 train_time:0.00

2: steps:120 interval_time:5.26 train_time:0.00

3: steps:160 interval_time:5.27 train_time:0.00

4: steps:200 interval_time:5.26 train_time:0.00

5: steps:240 interval_time:5.30 train_time:0.00

6: steps:280 interval_time:5.25 train_time:0.00

7: steps:320 interval_time:5.26 train_time:0.00

8: steps:360 interval_time:5.02 train_time:0.00

9: steps:400 interval_time:4.89 train_time:0.00

10: steps:440 interval_time:4.90 train_time:24.16

11: steps:480 interval_time:4.99 train_time:21.59

12: steps:520 interval_time:4.98 train_time:21.59

13: steps:560 interval_time:5.00 train_time:21.54

14: steps:600 interval_time:4.97 train_time:21.62

15: steps:640 interval_time:5.01 train_time:21.60

16: steps:680 interval_time:4.95 train_time:21.62

17: steps:720 interval_time:4.96 train_time:21.61

18: steps:760 interval_time:4.95 train_time:21.61

19: steps:800 interval_time:4.97 train_time:21.61

20: steps:840 interval_time:4.95 train_time:21.60

21: steps:880 interval_time:4.97 train_time:21.58

22: steps:920 interval_time:4.93 train_time:21.60

23: steps:960 interval_time:4.97 train_time:21.64

24: steps:1000 interval_time:4.93 train_time:21.62

25: steps:1040 interval_time:4.94 train_time:21.56

26: steps:1080 interval_time:4.95 train_time:21.61

RAM is fine now, how long do you train for?

hzwer commented 5 years ago

Why does the training take so much time? In the result you post earlier, each step takes only one-fifth of the time. My training speed is: interval_time:1s train_time:5s per step. I need 40 hours to train an agent.

ak9250 commented 5 years ago

@hzwer possibly due to the smaller training and testing set of only 10000 vs 200000? Here is a working notebook for testing and training, google colab uses a k80 gpu, are you training on a faster gpu? https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806

hzwer commented 5 years ago

Training set will not affect the speed. In your second post, v100 is 5 times faster than k80. I use titanxp. May I add your notebook to README?

ak9250 commented 5 years ago

ok I had some credits on gcp so I used a v100 to train first at that time, I am working on a kaggle version which uses a p100 will post it once it is working

ak9250 commented 5 years ago

still getting the utils.util module not found on kaggle for train.py was wondering if you could take a look and see what could be wrong. It is alot faster though than google colab and unzipping celebA took probably less than a minute https://www.kaggle.com/ahsenk/learningtopainting

ak9250 commented 5 years ago

@hzwer yes feel free to add the google colab notebook to the readme or in the repo as a jupyter notebook, do to save a copy in drive then file->save a copy in github then choose this repo and commit. Havent had a chance to fix the kaggle notebook yet and not sure what the error could be there.

ak9250 commented 5 years ago

@hzwer still getting out of memory error on a v100 instace of gcp I guess I will have to retrain it on a much smaller batch size to get more than 5000 strokes? step 1034, L2Loss = 0.004280742257833481 step 1035, L2Loss = 0.004276599269360304 step 1036, L2Loss = 0.004279495682567358 step 1037, L2Loss = 0.0042714583687484264 step 1038, L2Loss = 0.004271643236279488 Traceback (most recent call last): File "baseline/test.py", line 58, in actions = actor(torch.cat([canvas, img, stepnum, coord], 1)) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/home/jupyter/LearningToPaint/baseline/DRL/actor.py", line 109, in forward x = F.relu(self.bn1(self.conv1(x))) File "/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 862, in relu result = torch.relu(input) RuntimeError: CUDA out of memory. Tried to allocate 1024.00 KiB (GPU 0; 15.75 GiB total capacity; 14.78 GiB already allocated; 2.94 MiB free; 4.51 MiB cached)

hzwer commented 5 years ago

Oh, a step contains 5 strokes actually and you can add 'with no_grad' for forward to save gpu memory

hzwer commented 5 years ago

btw, if the stepnums of testing and training are different, you cannot get the best performance

ak9250 commented 5 years ago

@hzwer ok so 40 steps is 200 strokes and how exactly do you add with no_grad? Also I tried with 40 steps the test image looks like this 40steps and with 763 steps before memory error stops it, it looks like this 763steps I think 763 steps or 3815 strokes looks better quality wise

hzwer commented 5 years ago

I add with torch.no_grad to test, you can have a try. I trained the new models using wrong parameters 2 days ago, so I need another 2 day to provide you with some new trained actor.

ak9250 commented 5 years ago

@hzwer tried it again, it still stops at 763 steps RuntimeError: CUDA error: out of memory and thanks

hzwer commented 5 years ago

image I tried it in my GPU environment. I got GPU out of memory without torch.no_grad, but it was ok after adding this instruction. In theory, as the graph of each step is deleted in real time, there is no possibility of exceeding the GPU memory.

ak9250 commented 5 years ago

@hzwer does it stop for you at any step, like are you able to get to 3000? Screen Shot 2019-04-14 at 9 34 51 AM

ak9250 commented 5 years ago

@hzwer stops at 1074 steps on a p100 in kaggle Screen Shot 2019-04-14 at 11 15 21 AM

hzwer commented 5 years ago

image

Maybe you saved other variables in the GPU?

ak9250 commented 5 years ago

see the colab notebook, is there anything wrong? https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806

hzwer commented 5 years ago

image "with torch.no_grad():" was missed?

ak9250 commented 5 years ago

ah I overwrote it when writing out test.py again forgot to add that thanks

ak9250 commented 5 years ago

@hzwer are you looking to add the google colab notebook to the repo, I think its easier to setup and run anything from there including in any changes in the repo?

hzwer commented 5 years ago

You can make a pull request or I add it tomorrow. Thank you.