Training in C++? - Githubissues

meet-minimalist commented 3 years ago

Hey, you have showed such an amazing work to train NNs in C++. I would like to know what are the reasons for which you started training models in C++ instead of python? Because once the model definition has been written in pytorch in python and data pipeline has been setup, all the computation needs to be done will be done on GPU. So there wont be much drastic performance gains when migrating from python to c++. Please tell me some of your thoughts on this.

koba-jon commented 3 years ago

I was interested in implementing the NN programs in C++, and I want to improve my coding ability in C++, so I decided to write this code. However, I have investigated how mush the speed is different between Python and C++.

I found a strange result that there are cases in which it runs faster in Python than in C++. Here, the batch size, image size, and almost all components are matched between Python and C++. And, NN model used for training and test in C++: https://github.com/koba-jon/pytorch_cpp/tree/master/Dimensionality_Reduction/AE2d My article for details (Japanese only): https://qiita.com/koba-jon/items/274e5e4970da72216f73

		CPU (Core i7-8700) only		with GPU (GeForce GTX 1070)
		Python	C++	Python		C++
		Python	C++	nondeterministic(cudnn)	deterministic(cudnn)	C++
Training	times[time/epoch]	1h04m49s	1h03m00s	5m53s	7m42s	17m36s
Training	GPU memory[MiB]	2	9	933	913	2941
Testing	speed[seconds/data]	0.01189	0.01477	0.00102	0.00101	0.00101

Training in C++ is slow when the GPU is used. It has been identified that the causes of the delay are the "forward" and "backward" part, so it's not the part I wrote. This speed is faster than when it is CPU only, so it seems that CUDA is being used. But, I guess the coding of PyTorch for using the GPU may be different between Python and C++.

I have heard reports that the speed in C++ is faster when training only fully connected layers. I plan to investigate this matter around March.

meet-minimalist commented 3 years ago

Huge thanks to you for these interesting insights. I suspect the reason for high training time on C++ could be under optimized data pipeline or under-optimized-and-serialized CPU-to-GPU and GPU-to-CPU data copy instead of parallel async copy. But, still further investigation might help. I was also curious about this as many people used to train models on C++ and I wonder what on earth forced them to do this. :-P

koba-jon commented 3 years ago

This is a follow-up report.

I benchmarked again using the following three kinds of Neural Networks in PyTorch v1.8.0. I measured the training time based on iteration per second.

My article for details (Japanese only): https://qiita.com/koba-jon/items/59a64c6ec38ac7286d6b

Only Fully Connected Layers Model: AE1d

	CPU（Core i7-8700）	GPU（GeForce GTX 1070）
	CPU（Core i7-8700）	cudnn: deterministic	cudnn: non-deterministic
Python[iteration/s]	86.83	97.69	97.69
C++[iteration/s]	312.6	312.6	312.6
Speed Up (Python -> C++)	×3.6	×3.2	×3.2

Only Convolutional Layers Model: Discriminator

	CPU（Core i7-8700）	GPU（GeForce GTX 1070）
	CPU（Core i7-8700）	cudnn: deterministic	cudnn: non-deterministic
Python[iteration/s]	5.24	27.59	39.08
C++[iteration/s]	4.51	26.8	36.08
Speed Up (Python -> C++)	×0.86	×0.97	×0.92

Convolutional and Transposed Convolutional Layers Model: AE2d

	CPU（Core i7-8700）	GPU（GeForce GTX 1070）
	CPU（Core i7-8700）	cudnn: deterministic	cudnn: non-deterministic
Python[iteration/s]	1.14	9.56	14.39
C++[iteration/s]	1.05	9.16	13.44
Speed Up (Python -> C++)	×0.92	×0.96	×0.93

As above, compared to before, the speed of "AE2d" in C++ is much faster and improved. However, it still couldn't beat the execution speed of Python.

Looking at the details, it seems that the convolutional layer and the transposed convolutional layer are bad. However, in the case of only Fully Connected Layers, execution speed in C++ is much faster than Python. This alone may be worth training in C++.

I look forward to future improvements in the PyTorch C++ API for models of "2" and "3".

meet-minimalist commented 3 years ago

Thanks a lot for such a detailed experiments. One more thing that I would like to share to you that I have recently discovered, when transferring the training data from RAM to GPU memory, people generally use the concept of Pinned memory and that is a designated area of RAM from which memory copy into GPU memory is faster. I have seen this while working with TensorRT related operations in C++ where they allocate an input tensor memory on pinned memory area and once the data is there on this pinned memory they will call memcpy command to copy data from this pinned memory area to GPU for further computation. This may again give you some boost in C++ timings.

PS : Please create a paper / medium article of your findings along side this qiita.com blog. Because world needs to know about your findings. Keep it up.

koba-jon commented 3 years ago

Thank you for sharing such information. I follow the page below, and I will try to improve class "dataloader". https://pytorch.org/docs/stable/data.html#memory-pinning

Please look forward to follow-up report.

koba-jon / pytorch_cpp

Training in C++? #8