Use local model instead of HF_MODEL_ID

Oh, thank you.A new problem I found is that if you use three (odd numbers greater than 3 have not been tested) graphics cards (the graphics card is :Tesla P100 Pcie), the following problem will occur

I20231130 01:28:44.676849     7 main.cpp:135] Using devices: cuda:4,cuda:5,cuda:6
W20231130 01:28:52.751839     7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
I20231130 01:28:52.752111     7 engine.cpp:91] Initializing model from: /models/Yi-34B-Chat-4bits
W20231130 01:28:52.752171     7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
I20231130 01:28:53.237169     7 engine.cpp:98] Initializing model with dtype: Half
I20231130 01:28:53.237226     7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
F20231130 01:28:53.237545    23 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
F20231130 01:28:53.237592    24 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3F20231130 01:28:53.237596    25 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
*** Aborted at 1701307733 (unix time) try "date -d @1701307733" if you are using GNU date ***
F20231130 01:28:53.237592    24 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3F20231130 01:28:53.237596    25 embedding.h:80] Check failed: embedding_dim % world_size == 0 out_features 7168 not divisible by world_size 3
*** Check failure stack trace: ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7) received by PID 7 (TID 0x7f348affd000) from PID 7; stack trace: ***
    @     0x7f35733ee520 (unknown)
    @     0x7f35734429fc pthread_kill
    @     0x7f35733ee476 raise
    @     0x7f35733d47f3 abort
    @     0x55ea09f94743 folly::(anonymous namespace)::wrapped_abort()
    @     0x55ea09f4a33f google::LogMessage::Fail()
    @     0x55ea09f4a285 google::LogMessage::SendToLog()
    @     0x55ea09f49a95 google::LogMessage::Flush()
    @     0x55ea09f4d874 google::LogMessageFatal::~LogMessageFatal()
    @     0x55ea0a23f681 llm::ParallelEmbeddingImpl::ParallelEmbeddingImpl()
    @     0x55ea0a275dc6 torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a24480c _ZN3llm17ParallelEmbeddingCI1N5torch2nn12ModuleHolderINS_21ParallelEmbeddingImplEEEIRKlJS8_RKNS_12ParallelArgsERN3c1010ScalarTypeERKNSC_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a26d058 llm::hf::YiModelImpl::YiModelImpl()
    @     0x55ea0a284f98 torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a26ddba _ZN3llm2hf7YiModelCI2N5torch2nn12ModuleHolderINS0_11YiModelImplEEEIRKNS_9ModelArgsEJRKNS_9QuantArgsERKNS_12ParallelArgsERN3c1010ScalarTypeERKNSH_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a26df4d llm::hf::YiForCausalLMImpl::YiForCausalLMImpl()
    @     0x55ea0a2853dc torch::nn::ModuleHolder<>::ModuleHolder<>()
    @     0x55ea0a26e97c _ZN3llm2hf13YiForCausalLMCI1N5torch2nn12ModuleHolderINS0_17YiForCausalLMImplEEEIRKNS_9ModelArgsEJRKNS_9QuantArgsERKNS_12ParallelArgsERN3c1010ScalarTypeERKNSH_6DeviceEEvEEOT_DpOT0_
    @     0x55ea0a22f46e _ZZNK3llm2hfL13Yi_registeredMUlvE_clEvENKUlRKNS_9ModelArgsERKNS_9QuantArgsERKNS_12ParallelArgsEN3c1010ScalarTypeERKNSB_6DeviceEE_clES4_S7_SA_SC_SF_
    @     0x55ea0a23aee8 _ZSt13__invoke_implISt10unique_ptrIN3llm12CausalLMImplINS1_2hf13YiForCausalLMEEESt14default_deleteIS5_EERZNKS3_L13Yi_registeredMUlvE_clEvEUlRKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSJ_6DeviceEE_JSC_SF_SI_SK_SN_EET_St14__invoke_otherOT0_DpOT1_
    @     0x55ea0a238c25 _ZSt10__invoke_rISt10unique_ptrIN3llm8CausalLMESt14default_deleteIS2_EERZNKS1_2hfL13Yi_registeredMUlvE_clEvEUlRKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSH_6DeviceEE_JSA_SD_SG_SI_SL_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EESP_E4typeEOSQ_DpOSR_
    @     0x55ea0a235975 _ZNSt17_Function_handlerIFSt10unique_ptrIN3llm8CausalLMESt14default_deleteIS2_EERKNS1_9ModelArgsERKNS1_9QuantArgsERKNS1_12ParallelArgsEN3c1010ScalarTypeERKNSF_6DeviceEEZNKS1_2hfL13Yi_registeredMUlvE_clEvEUlS8_SB_SE_SG_SJ_E_E9_M_invokeERKSt9_Any_dataS8_SB_SE_OSG_SJ_
    @     0x55ea0a2dc8f4 std::function<>::operator()()
    @     0x55ea0a2dbb00 llm::CausalLM::create()
    @     0x55ea09f2324f llm::Worker::init_model()
    @     0x55ea09f24024 _ZZN3llm6Worker16init_model_asyncEN3c1010ScalarTypeERKNS_9ModelArgsERKNS_9QuantArgsEENUlvE_clEv
    @     0x55ea09f25466 _ZN5folly6detail8function14FunctionTraitsIFvvEE9callSmallIZN3llm6Worker16init_model_asyncEN3c1010ScalarTypeERKNS6_9ModelArgsERKNS6_9QuantArgsEEUlvE_EEvRNS1_4DataE
    @     0x55ea09e5a94d folly::detail::function::FunctionTraits<>::operator()()
    @     0x55ea0a2dd64d llm::Executor::internal_loop()
    @     0x55ea0a2dd331 _ZZN3llm8ExecutorC4EmENKUlvE_clEv
    @     0x55ea0a2dddc6 _ZSt13__invoke_implIvZN3llm8ExecutorC4EmEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
    @     0x55ea0a2ddd89 _ZSt8__invokeIZN3llm8ExecutorC4EmEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
./entrypoint.sh: line 28:     7 Aborted                 (core dumped) LD_LIBRARY_PATH=/app/lib:$LD_LIBRARY_PATH /app/bin/scalellm $ARGS "$@"

version: '2.2'

services:
  scalellm:
    image: vectorchai/scalellm:latest
    hostname: scalellm
    container_name: scalellm
    ports:
      - 8888:8888
      - 9999:9999
    environment:
      - DEVICE=cuda:4,cuda:5,cuda:6,cuda:7
    volumes:
      - /models/Yi:/models
    shm_size: 1g
    command: --logtostderr --model_path=/models/Yi-34B-Chat-4bits --model_type=Yi
    # turn on GPU access
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  scalellm-gateway:
    image: vectorchai/scalellm-gateway:latest
    hostname: scalellm-gateway
    container_name: scalellm-gateway
    ports:
      - 8080:8080
    command: --grpc-server=scalellm:8888
    depends_on:
      - scalellm

then run docker compose up

then run

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Yi-34B-Chat-4bits",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

{"error":{"code":14,"message":"error reading from server: EOF"}}

wrong, restart docker compose up

then run

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "01-ai/Yi-34B-Chat-4bits",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

  {"error":{"code":14,"message":"error reading from server: EOF"}}

docker compose logs

  [+] Running 2/0
 ✔ Container scalellm          Created                                                                                                                                                                                        0.0s 
 ✔ Container scalellm-gateway  Created                                                                                                                                                                                        0.0s 
Attaching to scalellm, scalellm-gateway
scalellm-gateway  | I1130 02:11:17.453270       1 main.go:38] Register grpc server at scalellm:8888
scalellm-gateway  | I1130 02:11:17.453674       1 main.go:53] Starting HTTP server at 0.0.0.0:8080 ...
scalellm          | I20231130 02:11:17.688066     7 main.cpp:135] Using devices: cuda:4,cuda:5,cuda:6,cuda:7
scalellm          | W20231130 02:11:28.074213     7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
scalellm          | I20231130 02:11:28.074481     7 engine.cpp:91] Initializing model from: /models/Yi-34B-Chat-4bits
scalellm          | W20231130 02:11:28.074532     7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
scalellm          | I20231130 02:11:28.554064     7 engine.cpp:98] Initializing model with dtype: Half
scalellm          | I20231130 02:11:28.554114     7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
scalellm          | I20231130 02:11:29.408568     7 model_loader.cpp:38] Loading model weights from /models/Yi-34B-Chat-4bits/model-00001-of-00002.safetensors
scalellm          | I20231130 02:11:32.559237     7 model_loader.cpp:38] Loading model weights from /models/Yi-34B-Chat-4bits/model-00002-of-00002.safetensors
scalellm          | I20231130 02:11:35.440680     7 engine.cpp:162] Initializing kv cache with block size: 16, max cache size: 5.00 GB, max memory utilization: 0.9
scalellm          | I20231130 02:11:35.440773     7 engine.cpp:179] Block size in bytes: 960.00 KB, block_size: 16, head_dim: 128, n_local_kv_heads: 2, n_layers: 60, dtype_size: 2
scalellm          | I20231130 02:11:35.441365     7 engine.cpp:198] cuda:4: allocated GPU memory: 4.63 GB, total GPU memory: 15.89 GB
scalellm          | I20231130 02:11:35.441399     7 engine.cpp:211] Initializing CUDA cache with max cache size: 5.00 GB
scalellm          | I20231130 02:11:35.441412     7 engine.cpp:219] Initializing kv cache with num blocks: 5461, block size: 16
scalellm          | I20231130 02:11:35.441426     7 engine.cpp:228] Initializing kv cache with key shape: [5461 2 16 16 8], value shape: [5461 2 128 16]
scalellm          | I20231130 02:11:36.937122     7 grpc_server.cpp:34] Started grpc server on 0.0.0.0:8888
scalellm          | I20231130 02:11:36.937489     7 http_server.cpp:71] Started http server on 0.0.0.0:9999
scalellm          | terminate called after throwing an instance of 'c10::Error'
scalellm          | terminate called recursively
scalellm          | *** Aborted at 1701310312 (unix time) try "date -d @1701310312" if you are using GNU date ***
scalellm          | PC: @                0x0 (unknown)
scalellm          | *** SIGABRT (@0x7) received by PID 7 (TID 0x7fa9ff5ff000) from PID 7; stack trace: ***
scalellm          |     @     0x7faaf4dee520 (unknown)
scalellm          |     @     0x7faaf4e429fc pthread_kill
scalellm          |   what():  CUDA error: no kernel image is available for execution on the device
scalellm          | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
scalellm          | For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
scalellm          | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
scalellm          | 
scalellm          | Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
scalellm          | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faafc46e38b in /app/lib/libc10.so)
scalellm          | frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7faafc468f3f in /app/lib/libc10.so)
scalellm          | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x58f (0x7faafc851eef in /app/lib/libc10_cuda.so)
scalellm          | frame #3: void at::native::gpu_reduce_kernel<c10::Half, c10::Half, 4, at::native::func_wrapper_t<c10::Half, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (at::native::sum_functor<c10::Half, float, c10::Half>::*)(at::TensorIterator&), &at::native::sum_functor<c10::Half, float, c10::Half>::operator(), 1u>, float (float, float)> >, double>(at::TensorIterator&, at::native::func_wrapper_t<c10::Half, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (at::native::sum_functor<c10::Half, float, c10::Half>::*)(at::TensorIterator&), &at::native::sum_functor<c10::Half, float, c10::Half>::operator(), 1u>, float (float, float)> > const&, double, at::native::AccumulationBuffer*, long) + 0x827 (0x7faafef9d7c7 in /app/lib/libtorch_cuda.so)
scalellm          | frame #4: <unknown function> + 0x2389d95 (0x7faafef89d95 in /app/lib/libtorch_cuda.so)
scalellm          | frame #5: at::native::structured_sum_out::impl(at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>, at::Tensor const&) + 0xa1 (0x7fab4fbf57d1 in /app/lib/libtorch_cpu.so)
scalellm          | frame #6: <unknown function> + 0x2f45dfc (0x7faaffb45dfc in /app/lib/libtorch_cuda.so)
scalellm          | frame #7: <unknown function> + 0x2f45eb2 (0x7faaffb45eb2 in /app/lib/libtorch_cuda.so)
scalellm          | frame #8: at::_ops::sum_dim_IntList::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>) + 0xa8 (0x7fab5051e2b8 in /app/lib/libtorch_cpu.so)
scalellm          | frame #9: <unknown function> + 0x3c3d1f6 (0x7fab5203d1f6 in /app/lib/libtorch_cpu.so)
scalellm          | frame #10: <unknown function> + 0x3c3d6f5 (0x7fab5203d6f5 in /app/lib/libtorch_cpu.so)
scalellm          | frame #11: at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::optional<c10::ScalarType>) + 0x198 (0x7fab505846b8 in /app/lib/libtorch_cpu.so)
scalellm          | frame #12: <unknown function> + 0x87084c (0x5637e5c2584c in /app/bin/scalellm)
scalellm          | frame #13: <unknown function> + 0x86f493 (0x5637e5c24493 in /app/bin/scalellm)
scalellm          | frame #14: <unknown function> + 0x6f5c74 (0x5637e5aaac74 in /app/bin/scalellm)
scalellm          | frame #15: <unknown function> + 0x6e92c3 (0x5637e5a9e2c3 in /app/bin/scalellm)
scalellm          | frame #16: <unknown function> + 0x604946 (0x5637e59b9946 in /app/bin/scalellm)
scalellm          | frame #17: <unknown function> + 0x5fb7ee (0x5637e59b07ee in /app/bin/scalellm)
scalellm          | frame #18: <unknown function> + 0x61475a (0x5637e59c975a in /app/bin/scalellm)
scalellm          | frame #19: <unknown function> + 0x5fc86e (0x5637e59b186e in /app/bin/scalellm)
scalellm          | frame #20: <unknown function> + 0x614dfa (0x5637e59c9dfa in /app/bin/scalellm)
scalellm          | frame #21: <unknown function> + 0x5fd663 (0x5637e59b2663 in /app/bin/scalellm)
scalellm          | frame #22: <unknown function> + 0x61528a (0x5637e59ca28a in /app/bin/scalellm)
scalellm          | frame #23: <unknown function> + 0x5fe2c3 (0x5637e59b32c3 in /app/bin/scalellm)
scalellm          | frame #24: <unknown function> + 0x666b1e (0x5637e5a1bb1e in /app/bin/scalellm)
scalellm          | frame #25: <unknown function> + 0x2b39a4 (0x5637e56689a4 in /app/bin/scalellm)
scalellm          | frame #26: <unknown function> + 0x2b3cd0 (0x5637e5668cd0 in /app/bin/scalellm)
scalellm          | frame #27: <unknown function> + 0x2b53c0 (0x5637e566a3c0 in /app/bin/scalellm)
scalellm          | frame #28: <unknown function> + 0x1ea94d (0x5637e559f94d in /app/bin/scalellm)
scalellm          | frame #29: <unknown function> + 0x66d64d (0x5637e5a2264d in /app/bin/scalellm)
scalellm          | frame #30: <unknown function> + 0x66d331 (0x5637e5a22331 in /app/bin/scalellm)
scalellm          | frame #31: <unknown function> + 0x66ddc6 (0x5637e5a22dc6 in /app/bin/scalellm)
scalellm          | frame #32: <unknown function> + 0x66dd89 (0x5637e5a22d89 in /app/bin/scalellm)
scalellm          | frame #33: <unknown function> + 0x66dd36 (0x5637e5a22d36 in /app/bin/scalellm)
scalellm          | frame #34: <unknown function> + 0x66dd0a (0x5637e5a22d0a in /app/bin/scalellm)
scalellm          | frame #35: <unknown function> + 0x66dcee (0x5637e5a22cee in /app/bin/scalellm)
scalellm          | frame #36: <unknown function> + 0xdc253 (0x7faaf50b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
scalellm          | frame #37: <unknown function> + 0x94ac3 (0x7faaf4e40ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
scalellm          | frame #38: clone + 0x44 (0x7faaf4ed1bf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
scalellm          | 
scalellm          | terminate called recursively
scalellm          | terminate called recursively
scalellm          |     @     0x7faaf4dee476 raise
scalellm          |     @     0x7faaf4dd47f3 abort
scalellm          |     @     0x7faaf508442a __gnu_cxx::__verbose_terminate_handler()
scalellm          |     @     0x7faaf508220c (unknown)
scalellm          |     @     0x7faaf5082277 std::terminate()
scalellm          |     @     0x7faaf50824d8 __cxa_throw
scalellm          |     @     0x5637e56bb37e __cxa_throw
scalellm          |     @     0x7faafc468f6b c10::detail::torchCheckFail()
scalellm          |     @     0x7faafc851eef c10::cuda::c10_cuda_check_implementation()
scalellm          |     @     0x7faafef9d7c7 at::native::gpu_reduce_kernel<>()
scalellm          |     @     0x7faafef89d95 at::native::sum_kernel_cuda()
scalellm          |     @     0x7fab4fbf57d1 at::native::structured_sum_out::impl()
scalellm          |     @     0x7faaffb45dfc at::(anonymous namespace)::wrapper_CUDA_sum_dim_IntList()
scalellm          |     @     0x7faaffb45eb2 c10::impl::wrap_kernel_functor_unboxed_<>::call()
scalellm          |     @     0x7fab5051e2b8 at::_ops::sum_dim_IntList::redispatch()
scalellm          |     @     0x7fab5203d1f6 torch::autograd::VariableType::(anonymous namespace)::sum_dim_IntList()
scalellm          |     @     0x7fab5203d6f5 c10::impl::wrap_kernel_functor_unboxed_<>::call()
scalellm          |     @     0x7fab505846b8 at::_ops::sum_dim_IntList::call()
scalellm          |     @     0x5637e5c2584c at::Tensor::sum()
scalellm          |     @     0x5637e5c24493 gemm_forward_cuda()
scalellm          |     @     0x5637e5aaac74 llm::ColumnParallelQLinearAWQImpl::quant_matmul()
scalellm          |     @     0x5637e5a9e2c3 llm::ColumnParallelQLinearImpl::forward()
scalellm          |     @     0x5637e59b9946 torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b07ee llm::hf::YiAttentionImpl::forward()
scalellm          |     @     0x5637e59c975a torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b186e llm::hf::YiDecoderLayerImpl::forward()
scalellm          |     @     0x5637e59c9dfa torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b2663 llm::hf::YiModelImpl::forward()
scalellm          |     @     0x5637e59ca28a torch::nn::ModuleHolder<>::operator()<>()
scalellm          |     @     0x5637e59b32c3 llm::hf::YiForCausalLMImpl::forward()
scalellm-gateway  | E1130 02:11:52.934136       1 forwarder.go:38] Failed to receive a response: rpc error: code = Unavailable desc = error reading from server: EOF
scalellm          | ./entrypoint.sh: line 28:     7 Aborted                 (core dumped) LD_LIBRARY_PATH=/app/lib:$LD_LIBRARY_PATH /app/bin/scalellm $ARGS "$@"
scalellm exited with code 134

vectorch-ai / ScaleLLM

Use local model instead of HF_MODEL_ID #21