CUDA - Githubissues

GXTX commented 10 years ago

Implement CUDA support.

madsbuvi commented 10 years ago

No. The opencl version is what you'll get from me.

CUDA support will not help older nvidia GPUs get better performance. If it would help performance in any way whatsoever, that would be wholly due to a lackluster opencl implementation from Nvidia and if so you should send a letter of complaint to them.

edit: In addition, CUDA would make runtime recompilation difficult. I need this to avoid indirections (which would kill performance to have).

GXTX commented 10 years ago

Oh well. I guess i'll stick to SHA1 Tripper even though I can't use them on any good *chan. ahwell

madsbuvi commented 10 years ago

There exist CPU-only versions You could try the win32 or x64 version at http://sourceforge.jp/projects/naniya/releases/

Though my version should still be able to obtain some acceleration from a 670, is there any specific issues other than the relatively low speed?

GXTX commented 10 years ago

This isn't really an issue, just a feature request. No-one seems interested in creating a CUDA tripminer.

madsbuvi commented 10 years ago

I don't know what the difference is between a "tripminer" and what this program is.

Basically, CUDA and OpenCL are, for Nvidia GPUs, just two different programming APIs for the same thing. The performance i can achieve with CUDA, i can also achieve with OpenCL, only with CUDA it would not be portable to other architectures. CUDA is generally nicer to program than OpenCL, however the need for re-compilation means that that advantage of CUDA over OpenCL is lost.

If your Nvidia GPU has poor performance in the OpenCL version, it wouldn't perform any better with CUDA.

The reason older nvidia cards (think pre-780/titan, or all cards with compute 1.#-3.0) perform poorly, is because the DES algorithm (which this type of tripcodes use) map extremely poorly to the architecture.

Need to put ALL data in registers (some 130+ 32bit integers) for best performance.
Older nvidias only support 64 registers per thread.
need to put 66+ integers elsewhere.

Where "elsewhere" is a choice between

spilling them into global memory, limiting performance to being bandwidth-bound
putting them in "shared" memory, limiting performance by poor occupancy.

While shared memory is blindingly faster than global memory, occupancy gets so limited that global memory is preferred. So i choose the first choice, which still makes for a slow searcher.

Using CUDA would not offer any solution to this.

I hope this clears up why nobody bothers to make a CUDA miner.

Edit: Majestic typos and grammar.

madsbuvi / MTY_CL

CUDA #10