Incremental training on the original cunet model, the training speed is very slow

nagadomi / waifu2x

Image Super-Resolution for Anime-Style Art

http://waifu2x.udp.jp/

MIT License

27.49k stars 2.71k forks source link

Incremental training on the original cunet model, the training speed is very slow #369

Closed CarbonPool closed 3 years ago

CarbonPool commented 3 years ago

demo cost1

When I removed the "resume" parameter, the training speed returned to normal.

nagadomi commented 3 years ago

Isn't it a disk cache issue? For example, the first time is slow and the second time is fast. Do you have enough RAM to not use swap?

CarbonPool commented 3 years ago

Isn't it a disk cache issue? For example, the first time is slow and the second time is fast. Do you have enough RAM to not use swap?

I reserved 28g of free memory space. If the "- resume models/cunet/art/noise3_model.t7" parameter is not used, the training speed only needs 4 minutes per round, otherwise it takes 30 minutes, and the GPU work efficiency is very low. I don't know if this is a problem in wsl2, I may need to actually go to a non-virtual environment to test it.

CarbonPool commented 3 years ago

In addition, if I use the cunet model for training and the parameter is "-resume models/my_cunet/noise3_model.t7", the test works normally. The problem only occurred in the original "cunet/art/noise3_model.t7".

nagadomi commented 3 years ago

I got it. It is model loading issue in train.lua. All models in models directory use cunn (torch's implementation) instead of cudnn for the convolution layer, for compatibility reasons. train.lua uses the loaded model as it is, so cudnn is not used.

https://github.com/nagadomi/waifu2x/blob/44503fb4c013d4aa7fc1434a5ade2f5a7c85a263/train.lua#L529

replace with

model = w2nn.load_model(settings.resume, settings.backend == "cudnn", "ascii")

will probably fix it.

CarbonPool commented 3 years ago

I got it. It is model loading issue in train.lua. All models in models directory use cunn (torch's implementation) instead of cudnn for the convolution layer, for compatibility reasons. train.lua uses the loaded model as it is, so cudnn is not used.

https://github.com/nagadomi/waifu2x/blob/44503fb4c013d4aa7fc1434a5ade2f5a7c85a263/train.lua#L529

replace with
model = w2nn.load_model(settings.resume, settings.backend == "cudnn", "ascii")
will probably fix it.

Thanks, it worked for me.

nagadomi commented 3 years ago

I pushed the above changes to the master branch. I haven't tested it.