mindspore-lab / mindyolo

A toolbox of yolo models and algorithms based on MindSpore
Apache License 2.0
94 stars 39 forks source link

请问如何打印出训练过程中还有多少显存? #352

Open Living190711 opened 2 weeks ago

Living190711 commented 2 weeks ago

代码仓:https://github.com/mindspore-lab/mindyolo/tree/v0.3.0 环境:modelart、mindspore_2.2.12-cann_7.0.1.1-py_3.9-euler_2.10.7-aarch64-snt3p、 image

问题描述: 在使用过程中发现1000+图片、bachsize=64、num_parallel_workers=8时,一个epoch训练时长达到2.24分钟左右,请问如何打印出我还剩多少显存?以及训练一轮2分多时间正常么?若不正常请问有哪些解决方法呢?

请回复下,谢谢!

终端日志如下所示: image

Living190711 commented 2 weeks ago

因为这边我想打印出现存还剩多少,以此来增加bachsize提高显存利用率来达到加快训练速度的效果。

Living190711 commented 2 weeks ago

请问若要断点续训如何操作呢? 这边我没有查看到相关参数,望解答一下,谢谢。 def get_parser_train(parents=None): parser = argparse.ArgumentParser(description="Train", parents=[parents] if parents else []) parser.add_argument("--task", type=str, default="detect", choices=["detect", "segment"]) parser.add_argument("--device_target", type=str, default="Ascend", help="device target, Ascend/GPU/CPU") parser.add_argument("--save_dir", type=str, default="./runs", help="save dir") parser.add_argument("--device_per_servers", type=int, default=8, help="device number on a server") parser.add_argument("--log_level", type=str, default="INFO", help="log level to print") parser.add_argument("--is_parallel", type=ast.literal_eval, default=False, help="Distribute train or not") parser.add_argument("--ms_mode", type=int, default=0, help="Running in GRAPH_MODE(0) or PYNATIVE_MODE(1) (default=0)") parser.add_argument("--ms_amp_level", type=str, default="O0", help="amp level, O0/O1/O2/O3") parser.add_argument("--keep_loss_fp32", type=ast.literal_eval, default=True, help="Whether to maintain loss using fp32/O0-level calculation") parser.add_argument("--ms_loss_scaler", type=str, default="static", help="train loss scaler, static/dynamic/none") parser.add_argument("--ms_loss_scaler_value", type=float, default=1024.0, help="static loss scale value") parser.add_argument("--ms_jit", type=ast.literal_eval, default=True, help="use jit or not") parser.add_argument("--ms_enable_graph_kernel", type=ast.literal_eval, default=False, help="use enable_graph_kernel or not") parser.add_argument("--ms_datasink", type=ast.literal_eval, default=False, help="Train with datasink.") parser.add_argument("--overflow_still_update", type=ast.literal_eval, default=True, help="overflow still update") parser.add_argument("--clip_grad", type=ast.literal_eval, default=False) parser.add_argument("--clip_grad_value", type=float, default=10.0) parser.add_argument("--ema", type=ast.literal_eval, default=True, help="ema") parser.add_argument("--weight", type=str, default="", help="initial weight path") parser.add_argument("--ema_weight", type=str, default="", help="initial ema weight path") parser.add_argument("--freeze", type=list, default=[], help="Freeze layers: backbone of yolov7=50, first3=0 1 2") parser.add_argument("--epochs", type=int, default=300, help="total train epochs") parser.add_argument("--per_batch_size", type=int, default=32, help="per batch size for each device") parser.add_argument("--img_size", type=list, default=640, help="train image sizes") parser.add_argument("--nbs", type=list, default=64, help="nbs") parser.add_argument("--accumulate", type=int, default=1, help="grad accumulate step, recommended when batch-size is less than 64") parser.add_argument("--auto_accumulate", type=ast.literal_eval, default=False, help="auto accumulate") parser.add_argument("--log_interval", type=int, default=100, help="log interval") parser.add_argument("--single_cls", type=ast.literal_eval, default=False, help="train multi-class data as single-class") parser.add_argument("--sync_bn", type=ast.literal_eval, default=False, help="use SyncBatchNorm, only available in DDP mode") parser.add_argument("--keep_checkpoint_max", type=int, default=100) parser.add_argument("--run_eval", type=ast.literal_eval, default=False, help="Whether to run eval during training") parser.add_argument("--conf_thres", type=float, default=0.001, help="object confidence threshold for run_eval") parser.add_argument("--iou_thres", type=float, default=0.65, help="IOU threshold for NMS for run_eval") parser.add_argument("--conf_free", type=ast.literal_eval, default=False, help="Whether the prediction result include conf") parser.add_argument("--rect", type=ast.literal_eval, default=False, help="rectangular training") parser.add_argument("--nms_time_limit", type=float, default=20.0, help="time limit for NMS") parser.add_argument("--recompute", type=ast.literal_eval, default=False, help="Recompute") parser.add_argument("--recompute_layers", type=int, default=0) parser.add_argument("--seed", type=int, default=2, help="set global seed") parser.add_argument("--summary", type=ast.literal_eval, default=True, help="collect train loss scaler or not") parser.add_argument("--profiler", type=ast.literal_eval, default=False, help="collect profiling data or not") parser.add_argument("--profiler_step_num", type=int, default=1, help="collect profiler data for how many steps.") parser.add_argument("--opencv_threads_num", type=int, default=2, help="set the number of threads for opencv") parser.add_argument("--strict_load", type=ast.literal_eval, default=True, help="strictly load the pretrain model")

# args for ModelArts
parser.add_argument("--enable_modelarts", type=ast.literal_eval, default=False, help="enable modelarts")
parser.add_argument("--data_url", type=str, default="", help="ModelArts: obs path to dataset folder")
parser.add_argument("--ckpt_url", type=str, default="", help="ModelArts: obs path to pretrain model checkpoint file")
parser.add_argument("--multi_data_url", type=str, default="", help="ModelArts: list of obs paths to multi-dataset folders")
parser.add_argument("--pretrain_url", type=str, default="", help="ModelArts: list of obs paths to multi-pretrain model files")
parser.add_argument("--train_url", type=str, default="", help="ModelArts: obs path to output folder")
parser.add_argument("--data_dir", type=str, default="/cache/data/",
                    help="ModelArts: local device path to dataset folder")
parser.add_argument("--ckpt_dir", type=str, default="/cache/pretrain_ckpt/",
                    help="ModelArts: local device path to checkpoint folder")