traning stop after 4 ticks

skyflynil / stylegan2

StyleGAN2 - Official TensorFlow Implementation with practical improvements

http://arxiv.org/abs/1912.04958

Other

120 stars 33 forks source link

traning stop after 4 ticks #2

Open b4nn3d opened 4 years ago

b4nn3d commented 4 years ago

hello there, i got your fork running on colab - semi fine. like i said in the titled, the training stop after 4 ticks

tick 0 kimg 0.1 lod 0.00 minibatch 32 time 58s sec/tick 57.7 sec/kimg 450.76 maintenance 0.0 gpumem 5.1 tick 1 kimg 6.1 lod 0.00 minibatch 32 time 12m 16s sec/tick 648.1 sec/kimg 107.73 maintenance 30.5 gpumem 5.1 tick 2 kimg 12.2 lod 0.00 minibatch 32 time 23m 19s sec/tick 644.3 sec/kimg 107.10 maintenance 18.0 gpumem 5.1 tick 3 kimg 18.2 lod 0.00 minibatch 32 time 34m 16s sec/tick 652.4 sec/kimg 108.45 maintenance 5.1 gpumem 5.1 ^C

^c like a keyboard interrupt.. but i didn't give such a command

skyflynil commented 4 years ago

Did you set the 'metric' to be none? There could be issues if you are running fid metric evaluation. I don't need metric thus I did not do any testing or code change for it. btw, I am able to train through google colab for > 10 ticks

b4nn3d commented 4 years ago

I launched the training with this.

!python run_training.py --result-dir=results --data-dir=datasets --dataset=blow --config=config-f --total-kimg=12000 --mirror-augment=true --metric=none --min-h=3 --min-w=3 --res-log2=7

skyflynil commented 4 years ago

Could be memory issue. You may try this to boost your instance memory. https://github.com/googlecolab/colabtools/issues/253

b4nn3d commented 4 years ago

i got OOM when i was trying with a 512512 dataset. this one was 384384. in your example you train a 640x384 dataset, so i don't see how this could be a problem ;)

btw, i'm trying with 18764 images.. how big is your dataset?

skyflynil commented 4 years ago

I actually did use that high memory instance (25G memory) to train. I have tried 512x512 and 640x384 and both were running fine (around 25k files).

b4nn3d commented 4 years ago

ok, it was a memory issue. trained for 220 ticks with your method

jwb95 commented 4 years ago

Hi there, @b4nn3d did this work out for you?

I'm on a high memory instance.
2k images
256^2 dimensions

Launching with: !python run_training.py --num-gpus=1 --data-dir=./dataset --config=config-f --dataset=myset --mirror-augment=true --metric=none --total-kimg=2000 --min-h=4 --min-w=4 --res-log2=6

So far I've never seen more than tick 0: tick 0 kimg 0.1 lod 0.00 minibatch 32 time 41s sec/tick 41.0 sec/kimg 320.50 maintenance 0.0 gpumem 6.1

Suggestions appreciated, cheers.