vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.41k stars 3.67k forks source link

[Usage]: how to terminal a vllm model and free or release gpu memory #5211

Open wellcasa opened 2 months ago

wellcasa commented 2 months ago

Your current environment

    def destroy(self):
        import gc
        import torch
        import ray
        import contextlib
        logger.info("vllm destroy")
        def cleanup():
            from vllm.distributed.parallel_state import destroy_model_parallel
            # from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
            os.environ["TOKENIZERS_PARALLELISM"] = "false"
            destroy_model_parallel()
            with contextlib.suppress(AssertionError):
                torch.distributed.destroy_process_group()
            gc.collect()
            torch.cuda.empty_cache()
            ray.shutdown()
        for _ in range(10):
            cleanup()
        del self.model.llm_engine.model_executor.driver_worker
        del self.model
        gc.collect()
        torch.cuda.empty_cache()

vllm=0.4.2 I tried this method, but it didn't work.

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

wellcasa commented 2 months ago

?

SuperBruceJia commented 2 months ago

Please try vllm=0.4.3 or vllm=0.4.1 Now, the GPU memory release problem has been fixed.

Best regards,

Shuyue June 11th, 2024

wellcasa commented 2 months ago
def destroy(self):
    try:
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
    except Exception as e:
        logger.error(f"Del destroy_model_parallel Failed {e}")
    try:
        from vllm.distributed.parallel_state import destroy_model_parallel
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        del llm.llm_engine.model_executor.driver_worker
    except Exception as e:
        logger.error(f"Del destroy_model_parallel Failed {e}")
    try:
        del self.model.llm_engine.driver_worker
    except Exception as e:
        logger.error(f"Del driver_worker Failed {e}")
    try:
        del self.model.llm_engine.model_executor
    except Exception as e:
        logger.error(f"Del model_executor Failed {e}")

    del self.model.llm_engine
    del self.model
    import gc
    gc.collect()
    torch.cuda.empty_cache()
    torch.distributed.destroy_process_group()
    import ray
    ray.shutdown()

I have tried all the solutions to the problem and it is possible. But it's not easy to use in version 0.5.0.