huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
842 stars 175 forks source link

vega 训练服务的侧拉起vgg19服务,抛异常未做捕获 导致 Cluster shutdown #154

Closed hdulinjian closed 2 years ago

hdulinjian commented 3 years ago

image

zhangjiajin commented 2 years ago
        try:
            signal.signal(signal.SIGINT, _shutdown_cluster)
            signal.signal(signal.SIGTERM, _shutdown_cluster)
            for step_name in PipelineConfig.steps:
                step_cfg = UserConfig().data.get(step_name)
                General.step_name = step_name
                PipeStepConfig.renew()
                PipeStepConfig.from_dict(step_cfg, skip_check=False)
                self._set_evaluator_config(step_cfg)
                logging.info("-" * 48)
                logging.info("  Step: {}".format(step_name))
                logging.info("-" * 48)
                logger.debug("Pipe step config: {}".format(PipeStepConfig()))
                if PipeStepConfig.type == "SearchPipeStep":
                    General._parallel = General.parallel_search
                if PipeStepConfig.type == "TrainPipeStep":
                    General._parallel = General.parallel_fully_train

                pipestep = PipeStep(name=step_name)
                self.steps.append(pipestep)
                pipestep.do()
        except Exception as e:
            logger.error("Failed to run pipeline.")
            logger.error(traceback.format_exc())
            error_occured = True
            if "pipestep" in locals():
                pipestep.update_status(Status.error, str(e))

        shutdown_cluster()