huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.91k stars 963 forks source link

SM80 support means only expensive GPUs can be used. Can SM60 and above be supported? #2627

Open profitgrowinginnovator opened 1 week ago

profitgrowinginnovator commented 1 week ago

candle-transformers requires SM80 for the llama-multiprocess and SM61 for quantized. Having support for SM60 would allow Nvidia Tesla and other SM60 cards to be used which costs some hundreds of dollars instead of needing cards which even on Ebay cost thousands of dollars or new tens of thousands.

more candle-kernels/src/custom_dp4a.cuh

ifndef CUSTOM_DP4A_CUH

define CUSTOM_DP4A_CUH

// Check if we're compiling for a CUDA architecture less than 6.1

if defined(CUDA_ARCH) && (CUDA_ARCH < 610)

// Custom implementation of __dp4a for sm_60 device inline int custom_dp4a(int a, int b, int c) { // Extract four 8-bit segments from each integer int a0 = (a) & 0xFF; int a1 = (a >> 8) & 0xFF; int a2 = (a >> 16) & 0xFF; int a3 = (a >> 24) & 0xFF;

int b0 = (b) & 0xFF;
int b1 = (b >> 8) & 0xFF;
int b2 = (b >> 16) & 0xFF;
int b3 = (b >> 24) & 0xFF;

// Perform the dot product of the four 8-bit segments
int dot = (a0 * b0) + (a1 * b1) + (a2 * b2) + (a3 * b3);

// Accumulate the result with 'c'
return c + dot;

}

// Redefine __dp4a to use custom_dp4a when compiling for sm_60

define __dp4a(a, b, c) custom_dp4a(a, b, c)

endif // __CUDA_ARCH__ < 610

endif // CUSTOM_DP4A_CUH

and including // Make work with CUDA 60

include "custom_dp4a.cuh"

// end

in candle-kernels/src/quantized.cu at least makes the __dp4a error go away.

llama_multiprocess is harder to get working. Any pointers really appreciated.

LaurentMazare commented 1 week ago

Thanks, I've minted #2628 to support __dp4a on older architectures which is similar to what you suggest (and is based on what is actually done in llama.cpp, see here). Re multiprocess, could you provide more details about which part actually doesn't work on your gpu?

profitgrowinginnovator commented 1 week ago

Many thanks! I am trying to run the llama_multiprocess and get Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading copy2d_bf16 Just like __dp4a, copy2d_bf16 is not supported in older architectures is my guess. However I was not able to solve it as easily as __dp4a.

LaurentMazare commented 1 week ago

This is likely because your hardware doesn't support bfloat16, could you try a f16 model instead, e.g. use --which v2-7b to get llama 2 (which was trained in f16) rather than llama3? Alternatively you can try something like --dtype f32 or --dtype f16 though if the model has been trained with bf16 or f32, using f16 is likely to result in some nans.