[Bug]: tensor model parallel group is not initialized

osafaimal commented 5 months ago

Bug Description

when i run the code below twice i have the error: tensor model parallel group is not initialized. the problem seem to come from vllm but i don't understand where precisely come the problem.

from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core import Document
from llama_index.core import Settings
from llama_index.llms.vllm import Vllm
from langchain.embeddings import HuggingFaceBgeEmbeddings

llm = Vllm(
    model=model,
    temperature=0,
    download_dir="./models",
    vllm_kwargs= vllm_kwargs,
)

embed_model = HuggingFaceBgeEmbeddings(model_name=emodel,model_kwargs={'device': 'cpu'},encode_kwargs={'normalize_embeddings': True})

Settings.llm=llm
Settings.embed_model=embed_model

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=5)

# assume documents are defined -> extract nodes
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
                    text_splitter,
                    title_extractor,
                    qa_extractor,
                    #embed_model,
                    ]
)

nodes = pipeline.run(
    # run the pipeline
    documents=[Document.example()],
    in_place=True,
    show_progress=True,
)
print([node for node in nodes])

Version

0.10.23

Steps to Reproduce

you run the code twice

Relevant Logs/Tracbacks

{
    "name": "AssertionError",
    "message": "tensor model parallel group is not initialized",
    "stack": "---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[4], line 26
     15 from llama_index.core.ingestion import IngestionPipeline
     17 pipeline = IngestionPipeline(
     18     transformations=[
     19                     text_splitter,
   (...)
     23                     ]
     24 )
---> 26 nodes = pipeline.run(
     27     # run the pipeline
     28     documents=[Document.example()],
     29     in_place=True,
     30     show_progress=True,
     31 )
     32 print([node for node in nodes])

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py:729, in IngestionPipeline.run(self, show_progress, documents, nodes, cache_collection, in_place, store_doc_text, num_workers, **kwargs)
    727         nodes = reduce(lambda x, y: x + y, nodes_parallel, [])
    728 else:
--> 729     nodes = run_transformations(
    730         nodes_to_run,
    731         self.transformations,
    732         show_progress=show_progress,
    733         cache=self.cache if not self.disable_cache else None,
    734         cache_collection=cache_collection,
    735         in_place=in_place,
    736         **kwargs,
    737     )
    739 if self.vector_store is not None:
    740     self.vector_store.add([n for n in nodes if n.embedding is not None])

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py:124, in run_transformations(nodes, transformations, in_place, cache, cache_collection, **kwargs)
    122         nodes = cached_nodes
    123     else:
--> 124         nodes = transform(nodes, **kwargs)
    125         cache.put(hash, nodes, collection=cache_collection)
    126 else:

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/interface.py:159, in BaseExtractor.__call__(self, nodes, **kwargs)
    151 def __call__(self, nodes: List[BaseNode], **kwargs: Any) -> List[BaseNode]:
    152     \"\"\"Post process nodes parsed from documents.
    153 
    154     Allows extractors to be chained.
   (...)
    157         nodes (List[BaseNode]): nodes to post-process
    158     \"\"\"
--> 159     return self.process_nodes(nodes, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/interface.py:142, in BaseExtractor.process_nodes(self, nodes, excluded_embed_metadata_keys, excluded_llm_metadata_keys, **kwargs)
    135 def process_nodes(
    136     self,
    137     nodes: List[BaseNode],
   (...)
    140     **kwargs: Any,
    141 ) -> List[BaseNode]:
--> 142     return asyncio.run(
    143         self.aprocess_nodes(
    144             nodes,
    145             excluded_embed_metadata_keys=excluded_embed_metadata_keys,
    146             excluded_llm_metadata_keys=excluded_llm_metadata_keys,
    147             **kwargs,
    148         )
    149     )

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/nest_asyncio.py:30, in _patch_asyncio.<locals>.run(main, debug)
     28 task = asyncio.ensure_future(main)
     29 try:
---> 30     return loop.run_until_complete(task)
     31 finally:
     32     if not task.done():

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/nest_asyncio.py:98, in _patch_loop.<locals>.run_until_complete(self, future)
     95 if not f.done():
     96     raise RuntimeError(
     97         'Event loop stopped before Future completed.')
---> 98 return f.result()

File /usr/lib/python3.10/asyncio/futures.py:201, in Future.result(self)
    199 self.__log_traceback = False
    200 if self._exception is not None:
--> 201     raise self._exception.with_traceback(self._exception_tb)
    202 return self._result

File /usr/lib/python3.10/asyncio/tasks.py:232, in Task.__step(***failed resolving arguments***)
    228 try:
    229     if exc is None:
    230         # We use the `send` method directly, because coroutines
    231         # don't have `__iter__` and `__next__` methods.
--> 232         result = coro.send(None)
    233     else:
    234         result = coro.throw(exc)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/interface.py:120, in BaseExtractor.aprocess_nodes(self, nodes, excluded_embed_metadata_keys, excluded_llm_metadata_keys, **kwargs)
    117 else:
    118     new_nodes = [deepcopy(node) for node in nodes]
--> 120 cur_metadata_list = await self.aextract(new_nodes)
    121 for idx, node in enumerate(new_nodes):
    122     node.metadata.update(cur_metadata_list[idx])

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/metadata_extractors.py:104, in TitleExtractor.aextract(self, nodes)
    102 async def aextract(self, nodes: Sequence[BaseNode]) -> List[Dict]:
    103     nodes_by_doc_id = self.separate_nodes_by_ref_id(nodes)
--> 104     titles_by_doc_id = await self.extract_titles(nodes_by_doc_id)
    105     return [{\"document_title\": titles_by_doc_id[node.ref_doc_id]} for node in nodes]

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/metadata_extractors.py:131, in TitleExtractor.extract_titles(self, nodes_by_doc_id)
    129 titles_by_doc_id = {}
    130 for key, nodes in nodes_by_doc_id.items():
--> 131     title_candidates = await self.get_title_candidates(nodes)
    132     combined_titles = \", \".join(title_candidates)
    133     titles_by_doc_id[key] = await self.llm.apredict(
    134         PromptTemplate(template=self.combine_template),
    135         context_str=combined_titles,
    136     )

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/extractors/metadata_extractors.py:147, in TitleExtractor.get_title_candidates(self, nodes)
    139 async def get_title_candidates(self, nodes: List[BaseNode]) -> List[str]:
    140     title_jobs = [
    141         self.llm.apredict(
    142             PromptTemplate(template=self.node_template),
   (...)
    145         for node in nodes
    146     ]
--> 147     return await run_jobs(
    148         title_jobs, show_progress=self.show_progress, workers=self.num_workers
    149     )

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/async_utils.py:113, in run_jobs(jobs, show_progress, workers)
    109         return await job
    111 pool_jobs = [worker(job) for job in jobs]
--> 113 return await asyncio_mod.gather(*pool_jobs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/tqdm/asyncio.py:79, in tqdm_asyncio.gather(cls, loop, timeout, total, *fs, **tqdm_kwargs)
     76     return i, await f
     78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
     80                                          total=total, **tqdm_kwargs)]
     81 return [i for _, i in sorted(res)]

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/tqdm/asyncio.py:79, in <listcomp>(.0)
     76     return i, await f
     78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
     80                                          total=total, **tqdm_kwargs)]
     81 return [i for _, i in sorted(res)]

File /usr/lib/python3.10/asyncio/tasks.py:571, in as_completed.<locals>._wait_for_one()
    568 if f is None:
    569     # Dummy value from _on_timeout().
    570     raise exceptions.TimeoutError
--> 571 return f.result()

File /usr/lib/python3.10/asyncio/futures.py:201, in Future.result(self)
    199 self.__log_traceback = False
    200 if self._exception is not None:
--> 201     raise self._exception.with_traceback(self._exception_tb)
    202 return self._result

File /usr/lib/python3.10/asyncio/tasks.py:232, in Task.__step(***failed resolving arguments***)
    228 try:
    229     if exc is None:
    230         # We use the `send` method directly, because coroutines
    231         # don't have `__iter__` and `__next__` methods.
--> 232         result = coro.send(None)
    233     else:
    234         result = coro.throw(exc)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/tqdm/asyncio.py:76, in tqdm_asyncio.gather.<locals>.wrap_awaitable(i, f)
     75 async def wrap_awaitable(i, f):
---> 76     return i, await f

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/async_utils.py:109, in run_jobs.<locals>.worker(job)
    107 async def worker(job: Coroutine) -> Any:
    108     async with semaphore:
--> 109         return await job

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py:114, in Dispatcher.span.<locals>.async_wrapper(*args, **kwargs)
    112     result = await func(*args, **kwargs)
    113 except Exception as e:
--> 114     self.span_drop(*args, id=id, err=e, **kwargs)
    115 else:
    116     self.span_exit(*args, id=id, result=result, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py:77, in Dispatcher.span_drop(self, id, err, *args, **kwargs)
     75 while c:
     76     for h in c.span_handlers:
---> 77         h.span_drop(*args, id=id, err=err, **kwargs)
     78     if not c.propagate:
     79         c = None

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/instrumentation/span_handlers/base.py:47, in BaseSpanHandler.span_drop(self, id, err, *args, **kwargs)
     45 def span_drop(self, *args, id: str, err: Optional[Exception], **kwargs) -> None:
     46     \"\"\"Logic for dropping a span i.e. early exit.\"\"\"
---> 47     self.prepare_to_drop_span(*args, id=id, err=err, **kwargs)
     48     if self.current_span_id == id:
     49         self.current_span_id = self.open_spans[id].parent_id

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/instrumentation/span_handlers/null.py:35, in NullSpanHandler.prepare_to_drop_span(self, id, err, *args, **kwargs)
     33 \"\"\"Logic for droppping a span.\"\"\"
     34 if err:
---> 35     raise err

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/instrumentation/dispatcher.py:112, in Dispatcher.span.<locals>.async_wrapper(*args, **kwargs)
    110 self.span_enter(*args, id=id, **kwargs)
    111 try:
--> 112     result = await func(*args, **kwargs)
    113 except Exception as e:
    114     self.span_drop(*args, id=id, err=e, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/llms/llm.py:498, in LLM.apredict(self, prompt, **prompt_args)
    496 else:
    497     formatted_prompt = self._get_prompt(prompt, **prompt_args)
--> 498     response = await self.acomplete(formatted_prompt, formatted=True)
    499     output = response.text
    501 dispatcher.event(LLMPredictEndEvent())

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py:232, in llm_completion_callback.<locals>.wrap.<locals>.wrapped_async_llm_predict(_self, *args, **kwargs)
    216 dispatcher.event(
    217     LLMCompletionStartEvent(
    218         model_dict=_self.to_dict(),
   (...)
    221     )
    222 )
    223 event_id = callback_manager.on_event_start(
    224     CBEventType.LLM,
    225     payload={
   (...)
    229     },
    230 )
--> 232 f_return_val = await f(_self, *args, **kwargs)
    234 if isinstance(f_return_val, AsyncGenerator):
    235     # intercept the generator and add a callback to the end
    236     async def wrapped_gen() -> CompletionResponseAsyncGen:

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/llms/vllm/base.py:279, in Vllm.acomplete(self, prompt, formatted, **kwargs)
    274 @llm_completion_callback()
    275 async def acomplete(
    276     self, prompt: str, formatted: bool = False, **kwargs: Any
    277 ) -> CompletionResponse:
    278     kwargs = kwargs if kwargs else {}
--> 279     return self.complete(prompt, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/core/llms/callbacks.py:294, in llm_completion_callback.<locals>.wrap.<locals>.wrapped_llm_predict(_self, *args, **kwargs)
    278 dispatcher.event(
    279     LLMCompletionStartEvent(
    280         model_dict=_self.to_dict(),
   (...)
    283     )
    284 )
    285 event_id = callback_manager.on_event_start(
    286     CBEventType.LLM,
    287     payload={
   (...)
    291     },
    292 )
--> 294 f_return_val = f(_self, *args, **kwargs)
    295 if isinstance(f_return_val, Generator):
    296     # intercept the generator and add a callback to the end
    297     def wrapped_gen() -> CompletionResponseGen:

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/llama_index/llms/vllm/base.py:252, in Vllm.complete(self, prompt, formatted, **kwargs)
    250 # build sampling parameters
    251 sampling_params = SamplingParams(**params)
--> 252 outputs = self._client.generate([prompt], sampling_params)
    253 return CompletionResponse(text=outputs[0].outputs[0].text)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:182, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, prefix_pos, use_tqdm, lora_request)
    175     token_ids = None if prompt_token_ids is None else prompt_token_ids[
    176         i]
    177     self._add_request(prompt,
    178                       sampling_params,
    179                       token_ids,
    180                       lora_request=lora_request,
    181                       prefix_pos=prefix_pos_i)
--> 182 return self._run_engine(use_tqdm)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py:208, in LLM._run_engine(self, use_tqdm)
    206 outputs: List[RequestOutput] = []
    207 while self.llm_engine.has_unfinished_requests():
--> 208     step_outputs = self.llm_engine.step()
    209     for output in step_outputs:
    210         if output.finished:

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:838, in LLMEngine.step(self)
    834 seq_group_metadata_list, scheduler_outputs = self.scheduler.schedule()
    836 if not scheduler_outputs.is_empty():
    837     # Execute the model.
--> 838     all_outputs = self._run_workers(
    839         \"execute_model\",
    840         driver_kwargs={
    841             \"seq_group_metadata_list\": seq_group_metadata_list,
    842             \"blocks_to_swap_in\": scheduler_outputs.blocks_to_swap_in,
    843             \"blocks_to_swap_out\": scheduler_outputs.blocks_to_swap_out,
    844             \"blocks_to_copy\": scheduler_outputs.blocks_to_copy,
    845         },
    846         use_ray_compiled_dag=USE_RAY_COMPILED_DAG)
    848     # Only the driver worker returns the sampling results.
    849     output = all_outputs[0]

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py:1041, in LLMEngine._run_workers(self, method, driver_args, driver_kwargs, max_concurrent_workers, use_ray_compiled_dag, *args, **kwargs)
   1038     driver_kwargs = kwargs
   1040 # Start the driver worker after all the ray workers.
-> 1041 driver_worker_output = getattr(self.driver_worker,
   1042                                method)(*driver_args, **driver_kwargs)
   1044 # Get the results of the ray workers.
   1045 if self.workers:

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/worker/worker.py:223, in Worker.execute_model(self, seq_group_metadata_list, blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy)
    220 if num_seq_groups == 0:
    221     return {}
--> 223 output = self.model_runner.execute_model(seq_group_metadata_list,
    224                                          self.gpu_cache)
    225 return output

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py:582, in ModelRunner.execute_model(self, seq_group_metadata_list, kv_caches)
    580 else:
    581     model_executable = self.model
--> 582 hidden_states = model_executable(
    583     input_ids=input_tokens,
    584     positions=input_positions,
    585     kv_caches=kv_caches,
    586     input_metadata=input_metadata,
    587 )
    589 # Sample the next token.
    590 output = self.model.sample(
    591     hidden_states=hidden_states,
    592     sampling_metadata=sampling_metadata,
    593 )

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:337, in LlamaForCausalLM.forward(self, input_ids, positions, kv_caches, input_metadata)
    330 def forward(
    331     self,
    332     input_ids: torch.Tensor,
   (...)
    335     input_metadata: InputMetadata,
    336 ) -> torch.Tensor:
--> 337     hidden_states = self.model(input_ids, positions, kv_caches,
    338                                input_metadata)
    339     return hidden_states

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:263, in LlamaModel.forward(self, input_ids, positions, kv_caches, input_metadata)
    256 def forward(
    257     self,
    258     input_ids: torch.Tensor,
   (...)
    261     input_metadata: InputMetadata,
    262 ) -> torch.Tensor:
--> 263     hidden_states = self.embed_tokens(input_ids)
    264     residual = None
    265     for i in range(len(self.layers)):

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py:109, in VocabParallelEmbedding.forward(self, input_)
    107     output_parallel[input_mask, :] = 0.0
    108 # Reduce across all the model parallel GPUs.
--> 109 output = tensor_model_parallel_all_reduce(output_parallel)
    110 return output

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/communication_op.py:30, in tensor_model_parallel_all_reduce(input_)
     18 \"\"\"All-reduce the input tensor across model parallel group.
     19 
     20 NOTE: This operation will be applied in-place on the input tensor if
   (...)
     27 value as the output. 
     28 \"\"\"
     29 # Bypass the function if we are using only 1 GPU.
---> 30 if get_tensor_model_parallel_world_size() == 1:
     31     return input_
     32 out = custom_all_reduce(input_)

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/parallel_state.py:137, in get_tensor_model_parallel_world_size()
    134 def get_tensor_model_parallel_world_size():
    135     \"\"\"Return world size for the tensor model parallel group.\"\"\"
    136     return torch.distributed.get_world_size(
--> 137         group=get_tensor_model_parallel_group())

File ~/Documents/LLM_from_base/.venv/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/parallel_state.py:122, in get_tensor_model_parallel_group()
    120 def get_tensor_model_parallel_group():
    121     \"\"\"Get the tensor model parallel group the caller rank belongs to.\"\"\"
--> 122     assert _TENSOR_MODEL_PARALLEL_GROUP is not None, (
    123         \"tensor model parallel group is not initialized\")
    124     return _TENSOR_MODEL_PARALLEL_GROUP

AssertionError: tensor model parallel group is not initialized"
}

dosubot[bot] commented 5 months ago

Hey @osafaimal, great to see you back! 😊 Hope everything's going well on your end. I'm diving into this intriguing issue you've encountered and will get back to you with a detailed response shortly. Hang tight!

logan-markewich commented 5 months ago

Shouldn't this issue be opened on vllm?

osafaimal commented 5 months ago

i don t know if it's fully vllm problem or it s how llamaindex use vllm?

dosubot[bot] commented 5 months ago

Based on the information you've provided and the context from the LlamaIndex repository, it seems like the AssertionError "tensor model parallel group is not initialized" you're encountering when running your LLM model using LlamaIndex v0.10.23 is likely due to the model's tensor parallelism (TP) configuration not being properly initialized or matched with the runtime environment's configuration. This error typically arises when there's a mismatch between the expected and actual configurations for distributed model parallelism, particularly in environments utilizing NVIDIA's TensorRT for optimized inference.

Here are some steps you can take to resolve this issue:

Correct Installation and Configuration: Verify that tensorrt_llm and its dependencies are correctly installed and configured in your environment. The error message suggests ensuring the installation with a specific command, which might be necessary if the environment has changed or if the initial setup was incomplete.
Matching World Sizes: Ensure that the world size defined by your model's configuration (the product of tp_size and pp_size) matches the runtime environment's world size (tensorrt_llm.mpi_world_size()). This might involve adjusting your model's configuration or the environment setup to ensure consistency.
Environment Support: The assertion pp_size == 1 indicates that pipeline parallelism is not supported in the Python runtime you're using. If your model or environment is configured to use pipeline parallelism, this could be a source of conflict. Ensure that your setup is compatible with the limitations of the Python runtime regarding pipeline parallelism.
GPU Availability: The assertion torch.cuda.is_available() checks for CUDA-enabled GPU availability. Ensure that your runtime environment has access to a compatible NVIDIA GPU, as the TensorRT-LLM model requires it for operation.
MPI Initialization: The error might also stem from an improperly initialized MPI (Message Passing Interface) environment, which is crucial for distributed computing scenarios like model parallelism. Ensure that your MPI environment is correctly initialized before running the model.

If after checking these points the issue persists, it might be helpful to consult the documentation or support resources for tensorrt_llm, LlamaIndex, or the specific hardware and software environment you're using to identify any additional configuration or compatibility requirements.

Regarding the functions you mentioned, unfortunately, I couldn't find specific information about get_tensor_model_parallel_group(), tensor_model_parallel_all_reduce(), and execute_model() within the LlamaIndex repository. However, the acomplete() function in the base.py file in the LlamaIndex codebase is used to asynchronously complete a given prompt. This function takes a prompt as input, optionally formats it if the formatted argument is set to False, and then uses the superclass's acomplete() method to asynchronously generate a completion response based on the given prompt. This is particularly useful for integrating with asynchronous codebases or handling operations that benefit from non-blocking execution. You can find more details about this function in the source code.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 5 months ago

@osafaimal llama-index is just doing

I don't know enough about vLLM to know if thats missing anything. It might be model/env specific?

osafaimal commented 5 months ago

I don't know enough about vLLM to know if thats missing anything. It might be model/env specific?

it s not model specific because i tested with multiple models and same result

osafaimal commented 5 months ago

i can't reproduce my problem anymore. i don't know what solved it

run-llama / llama_index