vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.42k stars 4.6k forks source link

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed shanshanpt closed 8 months ago

shanshanpt commented 1 year ago

prompt len: 6495, max_tokens: 21000 running command : python benchmark_serving.py --backend=vllm --host=localhost --port=8888 --dataset=/mnt/vllm/benchmarks/fake_data --tokenizer=/mnt/disk2/lama-tokenizer --num-prompts=1

python -m vllm.entrypoints.api_server --model=/mnt/disk2/llama-2-13b-chat-hf/ --tokenizer=/mnt/disk2/lama-tokenizer --tensor-parallel-size=2 --swap-space=64 --engine-use-ray --worker-use-ray --max-num-batched-tokens=60000

INFO 11-17 08:58:33 async_llm_engine.py:371] Received request 93296c1db0b24cfbb2ee20b7208ceced: prompt: ' U1XiBoEelEJeEDfIAGLrf27N9d1********dgbZq8fXYw215vKF2k77Cjb', 
sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, 
frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, 
use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], 
ignore_eos=True, max_tokens=21000, logprobs=None, prompt_logprobs=None, 
skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: None.

Error log:
(RayWorker pid=296668) [2023-11-17 08:38:23,099 E 296668 296668] logging.cc:97: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
(RayWorker pid=296668) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorker pid=296668) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorker pid=296668) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorker pid=296668) 
(RayWorker pid=296668) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
(RayWorker pid=296668) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5ab808e4d7 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5ab805836b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5ab073bb58 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #3: <unknown function> + 0x1c36b (0x7f5ab070c36b in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #4: <unknown function> + 0x2b930 (0x7f5ab071b930 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
(RayWorker pid=296668) frame #5: <unknown function> + 0x4d46c6 (0x7f5a50b766c6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #6: <unknown function> + 0x3ee77 (0x7f5ab8073e77 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f5ab806c69e in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f5ab806c7b9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayWorker pid=296668) frame #9: <unknown function> + 0x759cc8 (0x7f5a50dfbcc8 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f5a50dfc075 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
(RayWorker pid=296668) frame #11: ray::RayWorker.execute_method() [0x5ecd90]
(RayWorker pid=296668) frame #12: ray::RayWorker.execute_method() [0x5447b8]
(RayWorker pid=296668) frame #13: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #14: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #15: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #16: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #17: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #18: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #19: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #20: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #21: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #22: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #23: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #24: ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) frame #25: <unknown function> + 0x644015 (0x7f5abceb6015 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #26: std::_Function_handler<ray::Status (ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool)>::_M_invoke(std::_Any_data const&, ray::rpc::Address const&, ray::rpc::TaskType&&, std::string&&, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*&&, std::string*&&, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string&&, bool&&, bool&&, bool&&) + 0x157 (0x7f5abcdf2547 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #27: ray::core::CoreWorker::ExecuteTask(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*) + 0xc1e (0x7f5abcfdce5e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #28: std::_Function_handler<ray::Status (ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*), std::_Bind<ray::Status (ray::core::CoreWorker::*(ray::core::CoreWorker*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>, std::_Placeholder<5>, std::_Placeholder<6>, std::_Placeholder<7>, std::_Placeholder<8>))(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*)> >::_M_invoke(std::_Any_data const&, ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*&&, bool*&&, std::string*&&) + 0x58 (0x7f5abcf117d8 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #29: <unknown function> + 0x793684 (0x7f5abd005684 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #30: <unknown function> + 0x79498a (0x7f5abd00698a in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #31: <unknown function> + 0x7ac04e (0x7f5abd01e04e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #32: ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled(ray::TaskID, ray::core::InboundRequest&) + 0x10c (0x7f5abd01f35c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #33: <unknown function> + 0x7b02cb (0x7f5abd0222cb in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #34: ray::core::ActorSchedulingQueue::Add(long, long, std::function<void (std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status const&, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>, std::string const&, std::shared_ptr<ray::FunctionDescriptorInterface> const&, ray::TaskID, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&) + 0x400 (0x7f5abd023da0 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #35: ray::core::CoreWorkerDirectTaskReceiver::HandleTask(ray::rpc::PushTaskRequest const&, ray::rpc::PushTaskReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) + 0x1216 (0x7f5abd005016 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #36: <unknown function> + 0x735e25 (0x7f5abcfa7e25 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #37: <unknown function> + 0xa59886 (0x7f5abd2cb886 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #38: <unknown function> + 0xa4b55e (0x7f5abd2bd55e in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #39: <unknown function> + 0xa4bab6 (0x7f5abd2bdab6 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #40: <unknown function> + 0x102fdbb (0x7f5abd8a1dbb in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #41: <unknown function> + 0x1031d99 (0x7f5abd8a3d99 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #42: <unknown function> + 0x10324a2 (0x7f5abd8a44a2 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #43: ray::core::CoreWorker::RunTaskExecutionLoop() + 0x1c (0x7f5abcfa6a8c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #44: ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() + 0x8c (0x7f5abcfe825c in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #45: ray::core::CoreWorkerProcess::RunTaskExecutionLoop() + 0x1d (0x7f5abcfe840d in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #46: <unknown function> + 0x57b5d7 (0x7f5abcded5d7 in /usr/local/lib/python3.8/dist-packages/ray/_raylet.so)
(RayWorker pid=296668) frame #47: ray::RayWorker.execute_method() [0x504b7b]
(RayWorker pid=296668) frame #48: _PyEval_EvalFrameDefault + 0x851 (0x56bbe1 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #49: _PyFunction_Vectorcall + 0x1b6 (0x5f5ee6 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #50: _PyEval_EvalFrameDefault + 0x851 (0x56bbe1 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #51: _PyEval_EvalCodeWithName + 0x26a (0x569d8a in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #52: PyEval_EvalCode + 0x27 (0x68e267 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #53: ray::RayWorker.execute_method() [0x67d9b1]
(RayWorker pid=296668) frame #54: ray::RayWorker.execute_method() [0x67da2f]
(RayWorker pid=296668) frame #55: ray::RayWorker.execute_method() [0x67dad1]
(RayWorker pid=296668) frame #56: PyRun_SimpleFileExFlags + 0x197 (0x67fbf7 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #57: Py_RunMain + 0x212 (0x6b8082 in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #58: Py_BytesMain + 0x2d (0x6b840d in ray::RayWorker.execute_method)
(RayWorker pid=296668) frame #59: __libc_start_main + 0xf3 (0x7f5abe4b5083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorker pid=296668) frame #60: _start + 0x2e (0x5faa2e in ray::RayWorker.execute_method)
(RayWorker pid=296668) 
(RayWorker pid=296668) [2023-11-17 08:38:23,158 E 296668 296668] logging.cc:104: Stack trace: 
(RayWorker pid=296668)  /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf2e81a) [0x7f5abd7a081a] ray::operator<<()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf30fd8) [0x7f5abd7a2fd8] ray::TerminateHandler()
(RayWorker pid=296668) /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a) [0x7f5abc6865aa] _Unwind_Resume
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so(+0x759cc8) [0x7f5a50dfbcc8] THPVariable_clear()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so(_Z28THPVariable_subclass_deallocP7_object+0x325) [0x7f5a50dfc075] THPVariable_subclass_dealloc()
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x5ecd90]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x5447b8]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x54480a]
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_PSN_ISG_IS11_bESaIS16_EERSO_INS0_17LocalMemoryBufferEEPbPSsRKSN_INS0_16ConcurrencyGroupESaIS1F_EESsbbbEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S19_S1C_S1D_S1E_S1J_SsbbbEE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1T_OS19_S1C_OS1D_OS1E_S1J_S1S_ObS1X_S1X_+0x157) [0x7f5abcdf2547] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PS7_IS8_ISN_bESaISU_EEPN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0xc1e) [0x7f5abcfdce5e] ray::core::CoreWorker::ExecuteTask()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDES5_INS0_9RayObjectEEESaISO_EESR_PS7_IS8_ISL_bESaISS_EEPN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_RKSK_SR_SR_SV_S12_S13_S14_EPS18_St12_PlaceholderILi1EES1E_ILi2EES1E_ILi3EES1E_ILi4EES1E_ILi5EES1E_ILi6EES1E_ILi7EES1E_ILi8EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSR_S1U_OSV_OS12_OS13_OS14_+0x58) [0x7f5abcf117d8] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x79498a) [0x7f5abd00698a] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x7ac04e) [0x7f5abd01e04e] ray::core::InboundRequest::Accept()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x10c) [0x7f5abd01f35c] ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x7b02cb) [0x7f5abd0222cb] ray::core::ActorSchedulingQueue::ScheduleRequests()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core20ActorSchedulingQueue3AddEllSt8functionIFvS2_IFvNS_6StatusES2_IFvvEES5_EEEES2_IFvRKS3_S7_EES7_RKSsRKSt10shared_ptrINS_27FunctionDescriptorInterfaceEENS_6TaskIDERKSt6vectorINS_3rpc15ObjectReferenceESaISO_EE+0x400) [0x7f5abd023da0] ray::core::ActorSchedulingQueue::Add()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core28CoreWorkerDirectTaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyESt8functionIFvNS_6StatusES8_IFvvEESB_EE+0x1216) [0x7f5abd005016] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x735e25) [0x7f5abcfa7e25] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa59886) [0x7f5abd2cb886] EventTracker::RecordExecution()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa4b55e) [0x7f5abd2bd55e] std::_Function_handler<>::_M_invoke()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xa4bab6) [0x7f5abd2bdab6] boost::asio::detail::completion_handler<>::do_complete()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x102fdbb) [0x7f5abd8a1dbb] boost::asio::detail::scheduler::do_run_one()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x1031d99) [0x7f5abd8a3d99] boost::asio::detail::scheduler::run()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0x10324a2) [0x7f5abd8a44a2] boost::asio::io_context::run()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv+0x1c) [0x7f5abcfa6a8c] ray::core::CoreWorker::RunTaskExecutionLoop()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv+0x8c) [0x7f5abcfe825c] ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(RayWorker pid=296668) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv+0x1d) [0x7f5abcfe840d] ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x504b7b]
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalFrameDefault+0x851) [0x56bbe1] _PyEval_EvalFrameDefault
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyFunction_Vectorcall+0x1b6) [0x5f5ee6] _PyFunction_Vectorcall
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalFrameDefault+0x851) [0x56bbe1] _PyEval_EvalFrameDefault
(RayWorker pid=296668) ray::RayWorker.execute_method(_PyEval_EvalCodeWithName+0x26a) [0x569d8a] _PyEval_EvalCodeWithName
(RayWorker pid=296668) ray::RayWorker.execute_method(PyEval_EvalCode+0x27) [0x68e267] PyEval_EvalCode
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67d9b1]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67da2f]
(RayWorker pid=296668) ray::RayWorker.execute_method() [0x67dad1]
(RayWorker pid=296668) ray::RayWorker.execute_method(PyRun_SimpleFileExFlags+0x197) [0x67fbf7] PyRun_SimpleFileExFlags
(RayWorker pid=296668) ray::RayWorker.execute_method(Py_RunMain+0x212) [0x6b8082] Py_RunMain
(RayWorker pid=296668) ray::RayWorker.execute_method(Py_BytesMain+0x2d) [0x6b840d] Py_BytesMain
(RayWorker pid=296668) /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f5abe4b5083] __libc_start_main
(RayWorker pid=296668) ray::RayWorker.execute_method(_start+0x2e) [0x5faa2e] _start
(RayWorker pid=296668) *** SIGABRT received at time=1700210303 on cpu 36 ***
(RayWorker pid=296668) PC: @     0x7f5abe4d400b  (unknown)  raise
(RayWorker pid=296668)     @     0x7f5abe4d4090  (unknown)  (unknown)
(RayWorker pid=296668)     @     0x7f5abc73a38c       1008  (unknown)
(RayWorker pid=296668)     @     0x7ffec2633b50        248  (unknown)
(RayWorker pid=296668)     @                0x1  (unknown)  (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,161 E 296668 296668] logging.cc:361: *** SIGABRT received at time=1700210303 on cpu 36 ***
(RayWorker pid=296668) [2023-11-17 08:38:23,161 E 296668 296668] logging.cc:361: PC: @     0x7f5abe4d400b  (unknown)  raise
(RayWorker pid=296668) [2023-11-17 08:38:23,162 E 296668 296668] logging.cc:361:     @     0x7f5abe4d4090  (unknown)  (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,162 E 296668 296668] logging.cc:361:     @     0x7f5abc73a38c       1008  (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,163 E 296668 296668] logging.cc:361:     @     0x7ffec2633b50        248  (unknown)
(RayWorker pid=296668) [2023-11-17 08:38:23,165 E 296668 296668] logging.cc:361:     @                0x1  (unknown)  (unknown)
(RayWorker pid=296668) Fatal Python error: Aborted
(RayWorker pid=296668) Stack (most recent call first):
(RayWorker pid=296668)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/worker.py", line 782 in main_loop
(RayWorker pid=296668)   File "/usr/local/lib/python3.8/dist-packages/ray/_private/workers/default_worker.py", line 278 in <module>
2023-11-17 08:39:33,162 WARNING worker.py:2058 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffda6fb1ed560b0a9302273c5d01000000 Worker ID: d334f3efb7d78478dec2949c7ed2b0ae2563c2266e188631380e6bab Node ID: 4ebfbe49244d6a6cb436e651244d2576a9246ebe2d63480b0e5f80c1 Worker IP address: 172.16.47.112 Worker port: 33295 Worker PID: 296669 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f0e1b5e95e0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0cf13e5e20>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f0e1b5e95e0>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0cf13e5e20>)>
Traceback (most recent call last):
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 328, in engine_step
    request_outputs = await self.engine.step.remote()
ray.exceptions.RayTaskError: ray::_AsyncLLMEngine.step() (pid=296628, ip=172.16.47.112, actor_id=744b80b9032fa37fd1ee549001000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fa737994310>)
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 563, in step
    output = self._run_workers(
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 711, in _run_workers
    all_outputs = ray.get(all_outputs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: da6fb1ed560b0a9302273c5d01000000
    pid: 296669
    namespace: 1e594810-681e-4d7c-878c-65854434ef82
    ip: 172.16.47.112
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 11-17 08:39:33 async_llm_engine.py:134] Aborted request 8fd068bbbc5c4760ac2bc86d3174b33f.
INFO:     ::1:47850 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 328, in engine_step
    request_outputs = await self.engine.step.remote()
ray.exceptions.RayTaskError: ray::_AsyncLLMEngine.step() (pid=296628, ip=172.16.47.112, actor_id=744b80b9032fa37fd1ee549001000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fa737994310>)
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 563, in step
    output = self._run_workers(
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/llm_engine.py", line 711, in _run_workers
    all_outputs = ray.get(all_outputs)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: da6fb1ed560b0a9302273c5d01000000
    pid: 296669
    namespace: 1e594810-681e-4d7c-878c-65854434ef82
    ip: 172.16.47.112
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/usr/local/lib/python3.8/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.8/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/entrypoints/api_server.py", line 58, in generate
    async for request_output in results_generator:
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 436, in generate
    raise e
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 430, in generate
    async for request_output in stream:
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 70, in __anext__
    raise result
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/mnt/disk2/test/vllm_latest/vllm/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
2023-11-17 08:39:33,977 WARNING worker.py:2058 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2e42461d426b10b9fffc43ce01000000 Worker ID: 8689b72cd4f3d992c79a89e6fe9161d936dcc38e1ef9f5e8bb0f14b5 Node ID: 4ebfbe49244d6a6cb436e651244d2576a9246ebe2d63480b0e5f80c1 Worker IP address: 172.16.47.112 Worker port: 36633 Worker PID: 296668 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
shanshanpt commented 1 year ago

llama model max_model_len is 2048, so I modify the max_model_len to 60000 by force in 'vllm/config.py'.

tattrongvu commented 1 year ago

It happen to me too when i tried to apply Dynamic-NTK rope scaling. Environment: Hardware: single A100-80GB Model: Falcon-7b Rope scaling in config.json is: "max_position_embeddings": 2048 rope_scaling": { "factor": 4.0, "type": "dynamic" }, vLLM version 0.2.1.post1, pytorch==2.0.1 cu117

Error show as bellow:

.........312, 402, 2101, 272, 248, 2132, 4436, 25, 193].
INFO 11-19 21:25:21 llm_engine.py:624] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f0db2301990>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0da2545ff0>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f0db2301990>, request_tracker=<vllm.engine.async_llm_engine.RequestTracker object at 0x7f0da2545ff0>)>
Traceback (most recent call last):
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 330, in engine_step
    request_outputs = await self.engine.step_async()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 216, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 323, in execute_model
    output = self.model(
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/models/falcon.py", line 413, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 44, in forward
    hidden_states = _prune_hidden_states(hidden_states, input_metadata)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 129, in _prune_hidden_states
    selected_token_indices = torch.tensor(selected_token_indices,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 11-19 21:25:24 async_llm_engine.py:134] Aborted request cmpl-c009662d0f1d48e7a8ff8fb0cb9f0135.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 28, in _raise_exception_on_finish
    task.result()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 351, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 330, in engine_step
    request_outputs = await self.engine.step_async()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 191, in step_async
    output = await self._run_workers_async(
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 216, in _run_workers_async
    output = executor(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 323, in execute_model
    output = self.model(
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/models/falcon.py", line 413, in forward
    next_tokens = self.sampler(self.lm_head.weight, hidden_states,
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 44, in forward
    hidden_states = _prune_hidden_states(hidden_states, input_metadata)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 129, in _prune_hidden_states
    selected_token_indices = torch.tensor(selected_token_indices,
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 475, in completion_stream_generator
    async for res in result_generator:
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 436, in generate
    raise e
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 430, in generate
    async for request_output in stream:
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 70, in __anext__
    raise result
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 37, in _raise_exception_on_finish
    raise exc
  File "/home/jupyter/ProxyServer/test_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 32, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
junior-zsy commented 12 months ago

Long text cannot be used. I have encountered the same problem as them, which is very serious. Please help me solve it

1725 @WoosukKwon @zhuohan123

tattrongvu commented 12 months ago

Side note: I tried with HF transformers and for single A100 80GB it enough to make 12k tokens inference with falcon-7b. But when I tried with vLLM, Iam only use 4k token prompt which is much smaller and should be fit in 80GB GPU ram. So it not a problem of OOM here.