implement arch-specific tuning for cuda::fill()

GoogleCodeExporter commented 8 years ago

Fermi writes appear to be faster with 64-bit words while pre-Fermi devices are 
faster with 32-bit words.  The best wide_type should be chosen accordingly.

   <device name="GeForce 8800 GTS 512">
   <result  name="Bandwidth"  value="1.20713"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="2.37621"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="41.2306"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="40.7699"  units="GBytes/s"/>

   <device name="GeForce GTX 280">
   <result  name="Bandwidth"  value="33.9988"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="51.6067"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="75.0001"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="69.1726"  units="GBytes/s"/>

   <device name="GeForce GTX 480">
   <result  name="Bandwidth"  value="74.1055"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="136.304"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="146.078"  units="GBytes/s"/>
   <result  name="Bandwidth"  value="156.971"  units="GBytes/s"/>

Original issue reported on code.google.com by wnbell on 14 Dec 2010 at 5:55

GoogleCodeExporter commented 8 years ago

Issue 200 has been merged into this issue.

Original comment by wnbell on 17 Dec 2010 at 5:01

GoogleCodeExporter commented 8 years ago

This issue was closed by revision 9dc2784714.

Original comment by wnbell on 8 Feb 2011 at 6:01

Changed state: Fixed

lion03 / thrust

implement arch-specific tuning for cuda::fill() #286