Closed ak9250 closed 5 years ago
tried it on a v100 instead of a k80 and it seems to be training now
do I stop it manually with interrupt and how do I do inference once i have the trained models? So I have actor.pkl, critic.pkl, and wgan.pkl
@hzwer is there currently a way to do test a trained model is does that need to be added?
I submitted a code for a single image test. If you have a trained model, you can read the README to execute the test.
Note that the model is not stored in real time, and the model is only stored once when validation is performed.
@hzwer thanks for the test code, I am still testing it seems is does not recognize the k80 gpu on google colab? gives this error Traceback (most recent call last):
File "baseline/test.py", line 42, in
@hzwer I keep getting this error when trying to run test.py on p100
Traceback (most recent call last):
File "baseline/test.py", line 10, in
@hzwer thanks for the update
1) not sure why its having trouble with google colab notebook here running on a k80
https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806
2) Great, I wanted to reproduce the painting process so a sequence of generated images would work as well
also now I am getting this error running on a p100 in kaggle
torch.Size([1, 3, 128, 128]) torch.Size([1, 3, 128, 128])
Traceback (most recent call last):
File "baseline/test.py", line 59, in
Thank you for inspiring me to complete these tasks, I was in a procrastination.
1) this line is installing pytorch 0.4.1 !pip install https://download.pytorch.org/whl/cu80/torch-0.4.1-cp36-cp36m-linux_x86_64.whl in the colab notebook 2) thanks will try it now in colab and kaggle, also if you can include a colab notebook in the repo that includes all these steps that is much easier for anyone to reproduce the results in any environment. Will probably save time in the future with any issues.
@hzwer just tested it in kaggle and google colab and it is working now step 30, L2Loss = 0.047712087631225586 step 31, L2Loss = 0.04762433469295502 step 32, L2Loss = 0.04767460748553276 step 33, L2Loss = 0.04763370752334595 step 34, L2Loss = 0.047628842294216156 step 35, L2Loss = 0.047565679997205734 step 36, L2Loss = 0.04759366065263748 step 37, L2Loss = 0.04759349673986435 step 38, L2Loss = 0.047535862773656845 step 39, L2Loss = 0.04747484624385834
test and generated images, is there a way to get the painting process for generated images at each step?
Why it always interrupt itself after step 0?
I have not tried training yet in google colab with the new changes but I was getting that error before than switched to a p100 in kaggle and it was fixed
I will study how to make such a video tomorrow. We are already late at night. Good night! I think you can try this pretrained actor although it's not good enough. https://drive.google.com/file/d/1d4LJrzZcvDsIpLOIuDKRinoDe5-vL139/view?usp=sharing
thank you, I am downloading the celebA dataset again to try training with the new fixes in colab. Once I have a working colab notebook for training I will share it and I and also will share the kaggle notebook soon.
I think I have a idea for getting the gif to just combine the images at each step but I want to get 512x512 currently it is 128x128
tried changing test.py width and parameters to 512 from 128 but still get this error
File "baseline/test.py", line 58, in
Sorry, I found that using a higher resolution canvas does not improve image quality due to the bias between the neural renderer and the original renderer. If needed, we should use a higher resolution neural renderer for training.
ok I just upscaled the output so that is not much of a issue also how do you get "restricting the transparency of strokes, we can get paintings with different stroke effects, such as ink painting and oil painting as shown in Figure 6"
also around 763 steps i am getting Cuda out of memory error in colab although it shouldnt be
step 763, L2Loss = 0.009716738946735859
Traceback (most recent call last):
File "baseline/test.py", line 58, in
I can give you some different stroke renderers and trained agents in two days.
ok thanks!
@hzwer did you figure out what the problem could be that it is stopping itself in colab for training loaded 120000 images loaded 130000 images loaded 140000 images loaded 150000 images loaded 160000 images loaded 170000 images loaded 180000 images loaded 190000 images loaded 200000 images finish loading data, 197999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13
^C I still get the utils problem in kaggle so trying to figure out that
I think its a RAM issue colab has 12 GB ram once I changed loading to 100000 in env.py it started training further then stopped loaded 100000 images finish loading data, 97999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13
got it working by changing loading to 10000
loaded 10000 images finish loading data, 7999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13
RAM is fine now, how long do you train for?
Why does the training take so much time? In the result you post earlier, each step takes only one-fifth of the time. My training speed is: interval_time:1s train_time:5s per step. I need 40 hours to train an agent.
@hzwer possibly due to the smaller training and testing set of only 10000 vs 200000? Here is a working notebook for testing and training, google colab uses a k80 gpu, are you training on a faster gpu? https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806
Training set will not affect the speed. In your second post, v100 is 5 times faster than k80. I use titanxp. May I add your notebook to README?
ok I had some credits on gcp so I used a v100 to train first at that time, I am working on a kaggle version which uses a p100 will post it once it is working
still getting the utils.util module not found on kaggle for train.py was wondering if you could take a look and see what could be wrong. It is alot faster though than google colab and unzipping celebA took probably less than a minute https://www.kaggle.com/ahsenk/learningtopainting
@hzwer yes feel free to add the google colab notebook to the readme or in the repo as a jupyter notebook, do to save a copy in drive then file->save a copy in github then choose this repo and commit. Havent had a chance to fix the kaggle notebook yet and not sure what the error could be there.
@hzwer still getting out of memory error on a v100 instace of gcp I guess I will have to retrain it on a much smaller batch size to get more than 5000 strokes?
step 1034, L2Loss = 0.004280742257833481
step 1035, L2Loss = 0.004276599269360304
step 1036, L2Loss = 0.004279495682567358
step 1037, L2Loss = 0.0042714583687484264
step 1038, L2Loss = 0.004271643236279488
Traceback (most recent call last):
File "baseline/test.py", line 58, in
Oh, a step contains 5 strokes actually and you can add 'with no_grad' for forward to save gpu memory
btw, if the stepnums of testing and training are different, you cannot get the best performance
@hzwer ok so 40 steps is 200 strokes and how exactly do you add with no_grad? Also I tried with 40 steps the test image looks like this and with 763 steps before memory error stops it, it looks like this I think 763 steps or 3815 strokes looks better quality wise
I add with torch.no_grad to test, you can have a try. I trained the new models using wrong parameters 2 days ago, so I need another 2 day to provide you with some new trained actor.
@hzwer tried it again, it still stops at 763 steps RuntimeError: CUDA error: out of memory and thanks
I tried it in my GPU environment. I got GPU out of memory without torch.no_grad, but it was ok after adding this instruction. In theory, as the graph of each step is deleted in real time, there is no possibility of exceeding the GPU memory.
@hzwer does it stop for you at any step, like are you able to get to 3000?
@hzwer stops at 1074 steps on a p100 in kaggle
Maybe you saved other variables in the GPU?
see the colab notebook, is there anything wrong? https://colab.research.google.com/drive/1RGYSfsK1yvcH-hSRVEF4HImgOj0VM806
"with torch.no_grad():" was missed?
ah I overwrote it when writing out test.py again forgot to add that thanks
@hzwer are you looking to add the google colab notebook to the repo, I think its easier to setup and run anything from there including in any changes in the repo?
You can make a pull request or I add it tomorrow. Thank you.
when running train.py with celebA it automatically gets interrupted
loaded 200000 images finish loading data, 197999 training images, 2001 testing images observation_space (96, 128, 128, 7) action_space 13 /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:1332: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") /content/LearningToPaint/baseline/DRL/ddpg.py:158: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). s0 = torch.tensor(self.state, device='cpu') /content/LearningToPaint/baseline/DRL/ddpg.py:161: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). s1 = torch.tensor(state, device='cpu') ^C
also I am on gpu