Open ottolu opened 8 years ago
Hi @nomi-wei , just a clarification: our fast convnet algorithms use Winograd's convolution algorithms. But the same Shmuel Winograd did co-author the Coppersmith-Winograd fast matrix multiplication algorithm, so the confusion is understandable (I probably should not even mention that Winograd also devised fast DFT algorithms ;-)
@andravin Ha-ha, my bad. Thanks for your clarification. It's really really helpful. ;-) I didn't got the book you referenced, so I thought you might use winograd's famous mm method for this. LoL. No wonder I still find it hard to understand your approach, after I learned these mm algorithms from scratch these few days.
Thanks again! BTW, Andrew, I wonder if you're still working on improving this conv kernel stuff? If yes, that would be awesome.
I think in theory 8-bit is enough to carry the information with quantization.
googling for dp4a reaches a thread with Scott Gray in as first hit :-) https://devtalk.nvidia.com/default/topic/934562/cuda-programming-and-performance/nvidia-pascal-geforce-gtx-1080-amp-gtx-1070/post/4889687/ So I would say he's aware of it :-)
I was actually pondering dabbling with ints way back in 2014 http://computer-go.org/pipermail/computer-go/2014-December/007105.html ... but it's just one of many things that never survived contact with finite-hours-in-the-day :-)
Considering the effort involved in making gpus work, and work quickly, I would think the first thing to do might be to demonstrate using normal cpu code that you can get ok results? You could just fire up torch, and create torch.ByteTensor
s.
A few questions which occur:
Hmmm, I'm simply reciting back to you the questions that were stated to me when I mentioned the idea myself :-) http://computer-go.org/pipermail/computer-go/2014-December/007106.html
Thanks @scott-gray @andravin for the awesome Winograd work. That really makes small conv kernel run super fast! And the cuDNN team implement them so rapidly and nicely, really makes life much easier. Good job! @jdemouth
After these fancy ideas, I can't help thinking that what can we do to speed up training next?
Following the path of mathematics-based matrix multiplication optimization approach, with Winograd, we are likely to have reached to roof. On the top of Winograd, I know Le Gall & François did some great works around 2014, but no breakthrough improvement.(Edit: my bad, thanks @andravin for reminding.)Another interesting thing is the lack of FP16x2 support on GP104, which we highly look forward to, but instead, we got the full throughput dp4a, very powerful int8 computation ability. I think in theory 8-bit is enough to carry the information with quantization. But how could we make good use of dp4a in training, it would be interesting.
So I wonder if @scott-gray @andravin @jdemouth @hughperkins @bhack @soumith etc. could share any ideas about this? Sorry for can't cc all you lovely guys in community, who care about and contribute to DL performance. And any ideas are warmly welcomed!
@soumith, if you think here it's not so proper to discuss this topic, pls help me to close it and sorry for bothering. ;)