google-code-export / thrust

Automatically exported from code.google.com/p/thrust
Apache License 2.0
2 stars 2 forks source link

Thrust statically assumes warp_size == 32 #79

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This causes badness with external tools such as Ocelot [1].

[1] 
http://groups.google.com/group/gpuocelot/browse_thread/thread/19fa47041fb385

Original issue reported on code.google.com by jaredhoberock on 27 Jan 2010 at 8:39

GoogleCodeExporter commented 9 years ago
I think Revision 5f7d2ac96e resolved this problem

Original comment by wnbell on 7 Apr 2010 at 9:01

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Revised version of ocelot reports the shared memory race condition:

Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of 
'hydrazine::Exception'
  what():  [PC 101] [thread 255] [cta 1] st.shared.u32 [%r8 + 0], %r19 - Shared memory race condition, address 0x7f8 was previously read by thread 254 without a memory barrier in between.
Near 
/usr/include/cuda/thrust/detail/device/cuda/detail/stable_radix_sort.inl:881:0

Aborted

----------------------------------

Running a debugger points to

TestRadixSort<char>::operator()(const size_t n=1023)

Original comment by ryu...@gmail.com on 2 Jul 2010 at 6:12

GoogleCodeExporter commented 9 years ago
Thanks for the report Ryuuta.  If you insert a __syncthreads() at line 889 of 
stable_radix_sort.inl is the race condition eliminated?

Original comment by wnbell on 2 Jul 2010 at 6:32

GoogleCodeExporter commented 9 years ago
Unfortunately, no.

I still got the same race condition:

Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of 
'hydrazine::Exception'
  what():  [PC 102] [thread 255] [cta 1] st.shared.u32 [%r8 + 0], %r19 - Shared memory race condition, address 0x7f8 was previously read by thread 254 without a memory barrier in between.
Near 
/usr/include/cuda/thrust/detail/device/cuda/detail/stable_radix_sort.inl:881:0

Aborted

Original comment by ryu...@gmail.com on 2 Jul 2010 at 6:51

GoogleCodeExporter commented 9 years ago
Hrm, I can't seem to reproduce that error with Ocelot r611 and the most recent 
version of Thrust.  Any ideas?

nathan@rabota:~/NV/thrust/testing$ nvcc -I../ --cuda cuda/radix_sort.cu
nathan@rabota:~/NV/thrust/testing$ nvcc -I../ --cuda testframework.cu
nathan@rabota:~/NV/thrust/testing$ g++ -o tester radix_sort.cu.cpp  
testframework.cu.cpp `OcelotConfig -l`
nathan@rabota:~/NV/thrust/testing$ ./tester --verbose --device=1
==Ocelot== WARNING: Could not parse config file 'configure.ocelot', loading 
defaults.
There are 3 devices supporting CUDA

Device 0: "Quadro NVS 160M"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 267714560 bytes

Device 1: "Ocelot PTX Emulator"  [SELECTED]
  Major revision number:                         2
  Minor revision number:                         0
  Total amount of global memory:                 464691200 bytes

Device 2: "Ocelot Multicore CPU Backend (LLVM-JIT)"
  Major revision number:                         2
  Minor revision number:                         0
  Total amount of global memory:                 464596992 bytes

Running 4 unit tests.
[PASS]              TestRadixSort
[PASS]              TestRadixSortKeySimple<thrust::device_vector>
[PASS]              TestRadixSortUnalignedSimple
[PASS]              TestRadixSortVariableBits

================================================================
Totals: 0 failures, 0 known failures and 0 errors

Original comment by wnbell on 2 Jul 2010 at 9:07

GoogleCodeExporter commented 9 years ago
I forgot to mention you need to turn on race detector in configure.ocelot.
My apology.

All you need to do is to place the configure.ocelot in your working directory.
Here's the file:

{
    ocelot: "ocelot",
    version: "1.1.520",
    trace: { 
        enabled: true,
        database: "traces/database.trace",
        memory: false,
        branch: false,
        sharedComputation: false,
        instruction: false,
        parallelism: false,
        cacheSimulator: false,
        memoryChecker: true,
        raceDetector: true,
        warpSynchronous: {
            enabled: false,
            emitHotPaths: true
        },
        performanceBound: {
            enabled: false,
            protocol: "sm_20"
        },
        convergence: {
            enabled: false,
            logfile: "traces/convergence.csv",
            dot: true,
            render: true
        }
    },
    cuda: {
        implementation: CudaRuntime,
        runtimeApiTrace: "traces/CudaAPI.trace"
    },
    executive: {
        devices: [ nvidia, llvm, emulated ],
        optimizationLevel: basic,
        workerThreadLimit: 1
    }
}

Original comment by ryu...@gmail.com on 2 Jul 2010 at 11:51

GoogleCodeExporter commented 9 years ago
Thanks Ryuta.  I now observe an error, but it's different from the one you 
reported.  Which compiler are you using?  I'm currently using CUDA 3.1.

nathan@rabota:~/NV/thrust/testing$ scons backend=ocelot cuda/radix_sort.o 
testframework.o && g++ -o tester cuda/radix_sort.o testframework.o 
`OcelotConfig -l`
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
nvcc -o cuda/radix_sort.o -c -arch=sm_10 -Xcompiler 
-DTHRUST_DEVICE_BACKEND=THRUST_DEVICE_BACKEND_CUDA -Xcompiler -O2 -I 
/usr/local/cuda/include -I /home/nathan/NV/thrust -I 
/home/nathan/NV/thrust/testing cuda/radix_sort.cu
scons: `testframework.o' is up to date.
scons: done building targets.
nathan@rabota:~/NV/thrust/testing$ ./tester --device=1 --verbose
There are 3 devices supporting CUDA

Device 0: "Quadro NVS 160M"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 267714560 bytes

Device 1: "Ocelot PTX Emulator"  [SELECTED]
  Major revision number:                         2
  Minor revision number:                         0
  Total amount of global memory:                 462311424 bytes

Device 2: "Ocelot Multicore CPU Backend (LLVM-JIT)"
  Major revision number:                         2
  Minor revision number:                         0
  Total amount of global memory:                 462200832 bytes

Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of 
'hydrazine::Exception'
  what():  [PC 112] [thread 255] [cta 0] ld.shared.u64 %r85, [_ZN6thrust6detail6device4cuda6detail5sMem2E + 2044] - Memory access 0x7fc is not aligned to the access size ( 8 bytes )
Near 
/usr/local/cuda/bin/../include/thrust/detail/device/cuda/detail/stable_radix_sor
t.inl:420:0

Aborted

Original comment by wnbell on 3 Jul 2010 at 3:19

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi Nathan,

Based on the report by Ocelot that the race condition occurred at shared memory 
store
instruction, inserting a __syncthreads() at line 876 of stable_radix_sort.inl
fixed a race condition. At least Ocelot no longer complains the race condition.
Does this make sense?

Original comment by ryu...@gmail.com on 16 Jul 2010 at 3:09

GoogleCodeExporter commented 9 years ago
FIY, I found the same race condition at line 473 of stable_radix_sort.inl
Inserting __syncthreads() at line 467 fixed the problem.

Original comment by ryu...@gmail.com on 17 Jul 2010 at 3:04

GoogleCodeExporter commented 9 years ago
Hi Nathan,

it turns out that the race condition in this radix sort implementation was
false positive caused by ocelot:

http://groups.google.com/group/gpuocelot/browse_thread/thread/a4ca5196152f4713?h
l=en

Sorry for all the trouble.

I guess the radix sort in thrust will soon be supplanted by Duane Merrill's 
implementation so this might not be useful after all.

Original comment by ryu...@gmail.com on 29 Jul 2010 at 12:35

GoogleCodeExporter commented 9 years ago
I think this issue should be merged into 
http://code.google.com/p/thrust/issues/detail?id=213

Original comment by ryu...@gmail.com on 14 Sep 2010 at 8:09

GoogleCodeExporter commented 9 years ago
I think this must be fixed by now. Thrust 1.6 uses Merrill's sort.

Original comment by jaredhoberock on 7 May 2012 at 9:16