vectorch-ai / ScaleLLM

A high-performance inference system for large language models, designed for production environments.
https://docs.vectorch.com/
Apache License 2.0
383 stars 29 forks source link

Use local model instead of HF_MODEL_ID #21

Open iamxudada opened 10 months ago

iamxudada commented 10 months ago

Use local model instead of HF_MODEL_ID

guocuimi commented 10 months ago

Thanks for using ScaleLLM. Definitely you can use your local model by providing two additional gflags --model_id=* and --model_path. here is one example:

export MODEL_PATH=/YOUR/LOCAL/MODEL/PATH
export MODEL_ID=TheBloke/Llama-2-7B-chat-AWQ
docker run -it --gpus=all --net=host \
  -v $MODEL_PATH:$MODEL_PATH \
  -e DEVICE=cuda:0 \
  docker.io/vectorchai/scalellm:latest --logtostderr --model_path=$MODEL_PATH --model_id=$MODEL_ID
iamxudada commented 10 months ago

Oh, thank you.A new problem I found is that if you use three (odd numbers greater than 3 have not been tested) graphics cards (the graphics card is :Tesla P100 Pcie), the following problem will occur

I20231130 01:28:44.676849     7 main.cpp:135] Using devices: cuda:4,cuda:5,cuda:6
W20231130 01:28:52.751839     7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
I20231130 01:28:52.752111     7 engine.cpp:91] Initializing model from: /models/Yi-34B-Chat-4bits
W20231130 01:28:52.752171     7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
I20231130 01:28:53.237169     7 engine.cpp:98] Initializing model with dtype: Half
I20231130 01:28:53.237226     7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
F20231130 01:28:53.237545    23 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
F20231130 01:28:53.237592    24 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3F20231130 01:28:53.237596    25 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
*** Aborted at 1701307733 (unix time) try "date -d @1701307733" if you are using GNU date ***
F20231130 01:28:53.237592    24 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3F20231130 01:28:53.237596    25 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7) received by PID 7 (TID 0x7f348affd000) from PID 7; stack trace: ***
    @     0x7f35733ee520 (unknown)
    @     0x7f35734429fc pthread_kill
    @     0x7f35733ee476 raise
    @     0x7f35733d47f3 abort
    @     0x55ea09f94743 folly::(anonymous namespace)::wrapped_abort()
    @     0x55ea09f4a33f google::LogMessage::Fail()
    @     0x55ea09f4a285 google::LogMessage::SendToLog()
    @     0x55ea09f49a95 google::LogMessage::Flush()
    @     0x55ea09f4d874 google::LogMessageFatal::~LogMessageFatal()
    @     0x55ea0a23f681 llm::ParallelEmbeddingImpl::ParallelEmbeddingImpl()
    @     0x55ea0a275dc6 torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a24480c _ZN3llm17ParallelEmbeddingCI1N5torch2nn12ModuleHolderINS_21ParallelEmbeddingImplEEEIRKlJS8_RKNS_12ParallelArgsERN3c1010ScalarTypeERKNSC_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a26d058 llm::hf::YiModelImpl::YiModelImpl()
    @     0x55ea0a284f98 torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a26ddba _ZN3llm2hf7YiModelCI2N5torch2nn12ModuleHolderINS0_11YiModelImplEEEIRKNS_9ModelArgsEJRKNS_9QuantArgsERKNS_12ParallelArgsERN3c1010ScalarTypeERKNSH_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a26df4d llm::hf::YiForCausalLMImpl::YiForCausalLMImpl()
    @     0x55ea0a2853dc torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a26e97c _ZN3llm2hf13YiForCausalLMCI1N5torch2nn12ModuleHolderINS0_17YiForCausalLMImplEEEIRKNS_9ModelArgsEJRKNS_9QuantArgsERKNS_12ParallelArgsERN3c1010ScalarTypeERKNSH_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a22f46e _ZZNK3llm2hfL13Yi_registeredMUlvE_clEvENKUlRKNS_9ModelArgsERKNS_9QuantArgsERKNS_12ParallelArgsEN3c1010ScalarTypeERKNSB_6DeviceEE_clES4_S7_SA_SC_SF_
    @     0x55ea0a23aee8 _ZSt13__invoke_implISt10unique_ptrIN3llm12CausalLMImplINS1_2hf13YiForCausalLMEEESt14default_deleteIS5_EERZNKS3_L13Yi_registeredMUlvE_clEvEUlRKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSJ_6DeviceEE_JSC_SF_SI_SK_SN_EET_St14__invoke_otherOT0_DpOT1_
    @     0x55ea0a238c25 _ZSt10__invoke_rISt10unique_ptrIN3llm8CausalLMESt14default_deleteIS2_EERZNKS1_2hfL13Yi_registeredMUlvE_clEvEUlRKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSH_6DeviceEE_JSA_SD_SG_SI_SL_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EESP_E4typeEOSQ_DpOSR_
    @     0x55ea0a235975 _ZNSt17_Function_handlerIFSt10unique_ptrIN3llm8CausalLMESt14default_deleteIS2_EERKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSF_6DeviceEEZNKS1_2hfL13Yi_registeredMUlvE_clEvEUlS8_SB_SE_SG_SJ_E_E9_M_invokeERKSt9_Any_dataS8_SB_SE_OSG_SJ_
    @     0x55ea0a2dc8f4 std::function<>::operator()()
    @     0x55ea0a2dbb00 llm::CausalLM::create()
    @     0x55ea09f2324f llm::Worker::init_model()
    @     0x55ea09f24024 _ZZN3llm6Worker16init_model_asyncEN3c1010ScalarTypeERKNS_9ModelArgsERKNS_9QuantArgsEENUlvE_clEv
    @     0x55ea09f25466 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallIZN3llm6Worker16init_model_asyncEN3c1010ScalarTypeERKNS6_9ModelArgsERKNS6_9QuantArgsEEUlvE_EEvRNS1_4DataE
    @     0x55ea09e5a94d folly::detail::function::FunctionTraits<>::operator()()
    @     0x55ea0a2dd64d llm::Executor::internal_loop()
    @     0x55ea0a2dd331 _ZZN3llm8ExecutorC4EmENKUlvE_clEv
    @     0x55ea0a2dddc6 _ZSt13__invoke_implIvZN3llm8ExecutorC4EmEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @     0x55ea0a2ddd89 _ZSt8__invokeIZN3llm8ExecutorC4EmEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
./entrypoint.sh: line 28:     7 Aborted                 (core dumped) LD_LIBRARY_PATH=/app/lib:$LD_LIBRARY_PATH /app/bin/scalellm $ARGS "$@"
guocuimi commented 10 months ago

Yes, this is intentional. The number of GPUs is required to be an even number and also usually a power of 2.

iamxudada commented 10 months ago
version: '2.2'

services:
  scalellm:
    image: vectorchai/scalellm:latest
    hostname: scalellm
    container_name: scalellm
    ports:
      - 8888:8888
      - 9999:9999
    environment:
      - DEVICE=cuda:4,cuda:5,cuda:6,cuda:7
    volumes:
      - /models/Yi:/models
    shm_size: 1g
    command: --logtostderr --model_path=/models/Yi-34B-Chat-4bits --model_type=Yi
    # turn on GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  scalellm-gateway:
    image: vectorchai/scalellm-gateway:latest
    hostname: scalellm-gateway
    container_name: scalellm-gateway
    ports:
      - 8080:8080
    command: --grpc-server=scalellm:8888
    depends_on:
      - scalellm

then run docker compose up

then run

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Yi-34B-Chat-4bits",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

{"error":{"code":14,"message":"error reading from server: EOF"}}

wrong, restart docker compose up

then run

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "01-ai/Yi-34B-Chat-4bits",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

  {"error":{"code":14,"message":"error reading from server: EOF"}}

docker compose logs

  [+] Running 2/0
 ✔ Container scalellm          Created                                                                                                                                                                                        0.0s 
 ✔ Container scalellm-gateway  Created                                                                                                                                                                                        0.0s 
Attaching to scalellm, scalellm-gateway
scalellm-gateway  | I1130 02:11:17.453270       1 main.go:38] Register grpc server at scalellm:8888
scalellm-gateway  | I1130 02:11:17.453674       1 main.go:53] Starting HTTP server at 0.0.0.0:8080 ...
scalellm          | I20231130 02:11:17.688066     7 main.cpp:135] Using devices: cuda:4,cuda:5,cuda:6,cuda:7
scalellm          | W20231130 02:11:28.074213     7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
scalellm          | I20231130 02:11:28.074481     7 engine.cpp:91] Initializing model from: /models/Yi-34B-Chat-4bits
scalellm          | W20231130 02:11:28.074532     7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
scalellm          | I20231130 02:11:28.554064     7 engine.cpp:98] Initializing model with dtype: Half
scalellm          | I20231130 02:11:28.554114     7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
scalellm          | I20231130 02:11:29.408568     7 model_loader.cpp:38] Loading model weights from /models/Yi-34B-Chat-4bits/model-00001-of-00002.safetensors
scalellm          | I20231130 02:11:32.559237     7 model_loader.cpp:38] Loading model weights from /models/Yi-34B-Chat-4bits/model-00002-of-00002.safetensors
scalellm          | I20231130 02:11:35.440680     7 engine.cpp:162] Initializing kv cache with block size: 16, max cache size: 5.00 GB, max memory utilization: 0.9
scalellm          | I20231130 02:11:35.440773     7 engine.cpp:179] Block size in bytes: 960.00 KB, block_size: 16, head_dim: 128, n_local_kv_heads: 2, n_layers: 60, dtype_size: 2
scalellm          | I20231130 02:11:35.441365     7 engine.cpp:198] cuda:4: allocated GPU memory: 4.63 GB, total GPU memory: 15.89 GB
scalellm          | I20231130 02:11:35.441399     7 engine.cpp:211] Initializing CUDA cache with max cache size: 5.00 GB
scalellm          | I20231130 02:11:35.441412     7 engine.cpp:219] Initializing kv cache with num blocks: 5461, block size: 16
scalellm          | I20231130 02:11:35.441426     7 engine.cpp:228] Initializing kv cache with key shape: [5461 2 16 16 8], value shape: [5461 2 128 16]
scalellm          | I20231130 02:11:36.937122     7 grpc_server.cpp:34] Started grpc server on 0.0.0.0:8888
scalellm          | I20231130 02:11:36.937489     7 http_server.cpp:71] Started http server on 0.0.0.0:9999
scalellm          | terminate called after throwing an instance of 'c10::Error'
scalellm          | terminate called recursively
scalellm          | *** Aborted at 1701310312 (unix time) try "date -d @1701310312" if you are using GNU date ***
scalellm          | PC: @                0x0 (unknown)
scalellm          | *** SIGABRT (@0x7) received by PID 7 (TID 0x7fa9ff5ff000) from PID 7; stack trace: ***
scalellm          |     @     0x7faaf4dee520 (unknown)
scalellm          |     @     0x7faaf4e429fc pthread_kill
scalellm          |   what():  CUDA error: no kernel image is available for execution on the device
scalellm          | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
scalellm          | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
scalellm          | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
scalellm          | 
scalellm          | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
scalellm          | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faafc46e38b in /app/lib/libc10.so)
scalellm          | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7faafc468f3f in /app/lib/libc10.so)
scalellm          | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x58f (0x7faafc851eef in /app/lib/libc10_cuda.so)
scalellm          | frame #3: void at::native::gpu_reduce_kernel<c10::Half, c10::Half, 4, at::native::func_wrapper_t<c10::Half, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (at::native::sum_functor<c10::Half, float, c10::Half>::*)(at::TensorIterator&), &at::native::sum_functor<c10::Half, float, c10::Half>::operator(), 1u>, float (float, float)> >, double>(at::TensorIterator&, at::native::func_wrapper_t<c10::Half, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (at::native::sum_functor<c10::Half, float, c10::Half>::*)(at::TensorIterator&), &at::native::sum_functor<c10::Half, float, c10::Half>::operator(), 1u>, float (float, float)> > const&, double, at::native::AccumulationBuffer*, long) + 0x827 (0x7faafef9d7c7 in /app/lib/libtorch_cuda.so)
scalellm          | frame #4: <unknown function> + 0x2389d95 (0x7faafef89d95 in /app/lib/libtorch_cuda.so)
scalellm          | frame #5: at::native::structured_sum_out::impl(at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>, at::Tensor const&) + 0xa1 (0x7fab4fbf57d1 in /app/lib/libtorch_cpu.so)
scalellm          | frame #6: <unknown function> + 0x2f45dfc (0x7faaffb45dfc in /app/lib/libtorch_cuda.so)
scalellm          | frame #7: <unknown function> + 0x2f45eb2 (0x7faaffb45eb2 in /app/lib/libtorch_cuda.so)
scalellm          | frame #8: at::_ops::sum_dim_IntList::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>) + 0xa8 (0x7fab5051e2b8 in /app/lib/libtorch_cpu.so)
scalellm          | frame #9: <unknown function> + 0x3c3d1f6 (0x7fab5203d1f6 in /app/lib/libtorch_cpu.so)
scalellm          | frame #10: <unknown function> + 0x3c3d6f5 (0x7fab5203d6f5 in /app/lib/libtorch_cpu.so)
scalellm          | frame #11: at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>) + 0x198 (0x7fab505846b8 in /app/lib/libtorch_cpu.so)
scalellm          | frame #12: <unknown function> + 0x87084c (0x5637e5c2584c in /app/bin/scalellm)
scalellm          | frame #13: <unknown function> + 0x86f493 (0x5637e5c24493 in /app/bin/scalellm)
scalellm          | frame #14: <unknown function> + 0x6f5c74 (0x5637e5aaac74 in /app/bin/scalellm)
scalellm          | frame #15: <unknown function> + 0x6e92c3 (0x5637e5a9e2c3 in /app/bin/scalellm)
scalellm          | frame #16: <unknown function> + 0x604946 (0x5637e59b9946 in /app/bin/scalellm)
scalellm          | frame #17: <unknown function> + 0x5fb7ee (0x5637e59b07ee in /app/bin/scalellm)
scalellm          | frame #18: <unknown function> + 0x61475a (0x5637e59c975a in /app/bin/scalellm)
scalellm          | frame #19: <unknown function> + 0x5fc86e (0x5637e59b186e in /app/bin/scalellm)
scalellm          | frame #20: <unknown function> + 0x614dfa (0x5637e59c9dfa in /app/bin/scalellm)
scalellm          | frame #21: <unknown function> + 0x5fd663 (0x5637e59b2663 in /app/bin/scalellm)
scalellm          | frame #22: <unknown function> + 0x61528a (0x5637e59ca28a in /app/bin/scalellm)
scalellm          | frame #23: <unknown function> + 0x5fe2c3 (0x5637e59b32c3 in /app/bin/scalellm)
scalellm          | frame #24: <unknown function> + 0x666b1e (0x5637e5a1bb1e in /app/bin/scalellm)
scalellm          | frame #25: <unknown function> + 0x2b39a4 (0x5637e56689a4 in /app/bin/scalellm)
scalellm          | frame #26: <unknown function> + 0x2b3cd0 (0x5637e5668cd0 in /app/bin/scalellm)
scalellm          | frame #27: <unknown function> + 0x2b53c0 (0x5637e566a3c0 in /app/bin/scalellm)
scalellm          | frame #28: <unknown function> + 0x1ea94d (0x5637e559f94d in /app/bin/scalellm)
scalellm          | frame #29: <unknown function> + 0x66d64d (0x5637e5a2264d in /app/bin/scalellm)
scalellm          | frame #30: <unknown function> + 0x66d331 (0x5637e5a22331 in /app/bin/scalellm)
scalellm          | frame #31: <unknown function> + 0x66ddc6 (0x5637e5a22dc6 in /app/bin/scalellm)
scalellm          | frame #32: <unknown function> + 0x66dd89 (0x5637e5a22d89 in /app/bin/scalellm)
scalellm          | frame #33: <unknown function> + 0x66dd36 (0x5637e5a22d36 in /app/bin/scalellm)
scalellm          | frame #34: <unknown function> + 0x66dd0a (0x5637e5a22d0a in /app/bin/scalellm)
scalellm          | frame #35: <unknown function> + 0x66dcee (0x5637e5a22cee in /app/bin/scalellm)
scalellm          | frame #36: <unknown function> + 0xdc253 (0x7faaf50b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
scalellm          | frame #37: <unknown function> + 0x94ac3 (0x7faaf4e40ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
scalellm          | frame #38: clone + 0x44 (0x7faaf4ed1bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
scalellm          | 
scalellm          | terminate called recursively
scalellm          | terminate called recursively
scalellm          |     @     0x7faaf4dee476 raise
scalellm          |     @     0x7faaf4dd47f3 abort
scalellm          |     @     0x7faaf508442a __gnu_cxx::__verbose_terminate_handler()
scalellm          |     @     0x7faaf508220c (unknown)
scalellm          |     @     0x7faaf5082277 std::terminate()
scalellm          |     @     0x7faaf50824d8 __cxa_throw
scalellm          |     @     0x5637e56bb37e __cxa_throw
scalellm          |     @     0x7faafc468f6b c10::detail::torchCheckFail()
scalellm          |     @     0x7faafc851eef c10::cuda::c10_cuda_check_implementation()
scalellm          |     @     0x7faafef9d7c7 at::native::gpu_reduce_kernel<>()
scalellm          |     @     0x7faafef89d95 at::native::sum_kernel_cuda()
scalellm          |     @     0x7fab4fbf57d1 at::native::structured_sum_out::impl()
scalellm          |     @     0x7faaffb45dfc at::(anonymous namespace)::wrapper_CUDA_sum_dim_IntList()
scalellm          |     @     0x7faaffb45eb2 c10::impl::wrap_kernel_functor_unboxed_<>::call()
scalellm          |     @     0x7fab5051e2b8 at::_ops::sum_dim_IntList::redispatch()
scalellm          |     @     0x7fab5203d1f6 torch::autograd::VariableType::(anonymous namespace)::sum_dim_IntList()
scalellm          |     @     0x7fab5203d6f5 c10::impl::wrap_kernel_functor_unboxed_<>::call()
scalellm          |     @     0x7fab505846b8 at::_ops::sum_dim_IntList::call()
scalellm          |     @     0x5637e5c2584c at::Tensor::sum()
scalellm          |     @     0x5637e5c24493 gemm_forward_cuda()
scalellm          |     @     0x5637e5aaac74 llm::ColumnParallelQLinearAWQImpl::quant_matmul()
scalellm          |     @     0x5637e5a9e2c3 llm::ColumnParallelQLinearImpl::forward()
scalellm          |     @     0x5637e59b9946 torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b07ee llm::hf::YiAttentionImpl::forward()
scalellm          |     @     0x5637e59c975a torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b186e llm::hf::YiDecoderLayerImpl::forward()
scalellm          |     @     0x5637e59c9dfa torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b2663 llm::hf::YiModelImpl::forward()
scalellm          |     @     0x5637e59ca28a torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b32c3 llm::hf::YiForCausalLMImpl::forward()
scalellm-gateway  | E1130 02:11:52.934136       1 forwarder.go:38] Failed to receive a response: rpc error: code = Unavailable desc = error reading from server: EOF
scalellm          | ./entrypoint.sh: line 28:     7 Aborted                 (core dumped) LD_LIBRARY_PATH=/app/lib:$LD_LIBRARY_PATH /app/bin/scalellm $ARGS "$@"
scalellm exited with code 134
guocuimi commented 10 months ago

CUDA error: no kernel image is available for execution on the device sounds your GPU is too old to run the model inference. Please note that, for now, ScaleLLM only supports GPUS newer than Turing architecture (>= SM80). Tesla P100 is sm60.

iamxudada commented 10 months ago

ok,thank you