Closed GoogleCodeExporter closed 9 years ago
I think Revision 5f7d2ac96e resolved this problem
Original comment by wnbell
on 7 Apr 2010 at 9:01
[deleted comment]
Revised version of ocelot reports the shared memory race condition:
Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of
'hydrazine::Exception'
what(): [PC 101] [thread 255] [cta 1] st.shared.u32 [%r8 + 0], %r19 - Shared memory race condition, address 0x7f8 was previously read by thread 254 without a memory barrier in between.
Near
/usr/include/cuda/thrust/detail/device/cuda/detail/stable_radix_sort.inl:881:0
Aborted
----------------------------------
Running a debugger points to
TestRadixSort<char>::operator()(const size_t n=1023)
Original comment by ryu...@gmail.com
on 2 Jul 2010 at 6:12
Thanks for the report Ryuuta. If you insert a __syncthreads() at line 889 of
stable_radix_sort.inl is the race condition eliminated?
Original comment by wnbell
on 2 Jul 2010 at 6:32
Unfortunately, no.
I still got the same race condition:
Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of
'hydrazine::Exception'
what(): [PC 102] [thread 255] [cta 1] st.shared.u32 [%r8 + 0], %r19 - Shared memory race condition, address 0x7f8 was previously read by thread 254 without a memory barrier in between.
Near
/usr/include/cuda/thrust/detail/device/cuda/detail/stable_radix_sort.inl:881:0
Aborted
Original comment by ryu...@gmail.com
on 2 Jul 2010 at 6:51
Hrm, I can't seem to reproduce that error with Ocelot r611 and the most recent
version of Thrust. Any ideas?
nathan@rabota:~/NV/thrust/testing$ nvcc -I../ --cuda cuda/radix_sort.cu
nathan@rabota:~/NV/thrust/testing$ nvcc -I../ --cuda testframework.cu
nathan@rabota:~/NV/thrust/testing$ g++ -o tester radix_sort.cu.cpp
testframework.cu.cpp `OcelotConfig -l`
nathan@rabota:~/NV/thrust/testing$ ./tester --verbose --device=1
==Ocelot== WARNING: Could not parse config file 'configure.ocelot', loading
defaults.
There are 3 devices supporting CUDA
Device 0: "Quadro NVS 160M"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 267714560 bytes
Device 1: "Ocelot PTX Emulator" [SELECTED]
Major revision number: 2
Minor revision number: 0
Total amount of global memory: 464691200 bytes
Device 2: "Ocelot Multicore CPU Backend (LLVM-JIT)"
Major revision number: 2
Minor revision number: 0
Total amount of global memory: 464596992 bytes
Running 4 unit tests.
[PASS] TestRadixSort
[PASS] TestRadixSortKeySimple<thrust::device_vector>
[PASS] TestRadixSortUnalignedSimple
[PASS] TestRadixSortVariableBits
================================================================
Totals: 0 failures, 0 known failures and 0 errors
Original comment by wnbell
on 2 Jul 2010 at 9:07
I forgot to mention you need to turn on race detector in configure.ocelot.
My apology.
All you need to do is to place the configure.ocelot in your working directory.
Here's the file:
{
ocelot: "ocelot",
version: "1.1.520",
trace: {
enabled: true,
database: "traces/database.trace",
memory: false,
branch: false,
sharedComputation: false,
instruction: false,
parallelism: false,
cacheSimulator: false,
memoryChecker: true,
raceDetector: true,
warpSynchronous: {
enabled: false,
emitHotPaths: true
},
performanceBound: {
enabled: false,
protocol: "sm_20"
},
convergence: {
enabled: false,
logfile: "traces/convergence.csv",
dot: true,
render: true
}
},
cuda: {
implementation: CudaRuntime,
runtimeApiTrace: "traces/CudaAPI.trace"
},
executive: {
devices: [ nvidia, llvm, emulated ],
optimizationLevel: basic,
workerThreadLimit: 1
}
}
Original comment by ryu...@gmail.com
on 2 Jul 2010 at 11:51
Thanks Ryuta. I now observe an error, but it's different from the one you
reported. Which compiler are you using? I'm currently using CUDA 3.1.
nathan@rabota:~/NV/thrust/testing$ scons backend=ocelot cuda/radix_sort.o
testframework.o && g++ -o tester cuda/radix_sort.o testframework.o
`OcelotConfig -l`
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
nvcc -o cuda/radix_sort.o -c -arch=sm_10 -Xcompiler
-DTHRUST_DEVICE_BACKEND=THRUST_DEVICE_BACKEND_CUDA -Xcompiler -O2 -I
/usr/local/cuda/include -I /home/nathan/NV/thrust -I
/home/nathan/NV/thrust/testing cuda/radix_sort.cu
scons: `testframework.o' is up to date.
scons: done building targets.
nathan@rabota:~/NV/thrust/testing$ ./tester --device=1 --verbose
There are 3 devices supporting CUDA
Device 0: "Quadro NVS 160M"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 267714560 bytes
Device 1: "Ocelot PTX Emulator" [SELECTED]
Major revision number: 2
Minor revision number: 0
Total amount of global memory: 462311424 bytes
Device 2: "Ocelot Multicore CPU Backend (LLVM-JIT)"
Major revision number: 2
Minor revision number: 0
Total amount of global memory: 462200832 bytes
Running 4 unit tests.
Running TestRadixSort...terminate called after throwing an instance of
'hydrazine::Exception'
what(): [PC 112] [thread 255] [cta 0] ld.shared.u64 %r85, [_ZN6thrust6detail6device4cuda6detail5sMem2E + 2044] - Memory access 0x7fc is not aligned to the access size ( 8 bytes )
Near
/usr/local/cuda/bin/../include/thrust/detail/device/cuda/detail/stable_radix_sor
t.inl:420:0
Aborted
Original comment by wnbell
on 3 Jul 2010 at 3:19
[deleted comment]
[deleted comment]
Hi Nathan,
Based on the report by Ocelot that the race condition occurred at shared memory
store
instruction, inserting a __syncthreads() at line 876 of stable_radix_sort.inl
fixed a race condition. At least Ocelot no longer complains the race condition.
Does this make sense?
Original comment by ryu...@gmail.com
on 16 Jul 2010 at 3:09
FIY, I found the same race condition at line 473 of stable_radix_sort.inl
Inserting __syncthreads() at line 467 fixed the problem.
Original comment by ryu...@gmail.com
on 17 Jul 2010 at 3:04
Hi Nathan,
it turns out that the race condition in this radix sort implementation was
false positive caused by ocelot:
http://groups.google.com/group/gpuocelot/browse_thread/thread/a4ca5196152f4713?h
l=en
Sorry for all the trouble.
I guess the radix sort in thrust will soon be supplanted by Duane Merrill's
implementation so this might not be useful after all.
Original comment by ryu...@gmail.com
on 29 Jul 2010 at 12:35
I think this issue should be merged into
http://code.google.com/p/thrust/issues/detail?id=213
Original comment by ryu...@gmail.com
on 14 Sep 2010 at 8:09
I think this must be fixed by now. Thrust 1.6 uses Merrill's sort.
Original comment by jaredhoberock
on 7 May 2012 at 9:16
Original issue reported on code.google.com by
jaredhoberock
on 27 Jan 2010 at 8:39