lion03 / thrust

Automatically exported from code.google.com/p/thrust
Apache License 2.0
0 stars 0 forks source link

improve cuda scan performance #294

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
1) __forceinline__ the block scan routines
2) avoid shared memory writes in the serial phase (accumulate only)
3) use RSS instead of SSA to save one write per element (total: 2 reads + 1 
write)

Original issue reported on code.google.com by wnbell on 24 Jan 2011 at 7:26

GoogleCodeExporter commented 8 years ago
4) use raking approach to block scan

Original comment by wnbell on 24 Jan 2011 at 7:36

GoogleCodeExporter commented 8 years ago

Original comment by wnbell on 6 Feb 2011 at 6:26

GoogleCodeExporter commented 8 years ago
Revision f38f8eeaba40 implements 1) and 3).  We'll make the remaining 
improvements in v1.6

Original comment by wnbell on 1 Sep 2011 at 7:47

GoogleCodeExporter commented 8 years ago
5) Use device and type-specific tuning parameters

Original comment by wnbell on 1 Sep 2011 at 7:49

GoogleCodeExporter commented 8 years ago
Issue 202 has been merged into this issue.

Original comment by wnbell on 1 Sep 2011 at 7:49

GoogleCodeExporter commented 8 years ago

Original comment by wnbell on 24 Jan 2012 at 1:55

GoogleCodeExporter commented 8 years ago
Forwarded to https://github.com/thrust/thrust/issues/47

Original comment by jaredhoberock on 7 May 2012 at 8:52