sebastianopazo1 commented 1 month ago

Hi! I'm having some trouble with the demo. I installed all the required libraries, except for the version of cuda which is 11.1. I'm getting the following error.

Before torch.distributed.barrier() End torch.distributed.barrier() Loading config file from config/aios_smplx_inference.py

sha: be1ea5a357503c2502dd4d3fe38f826bff4f4e52, status: has uncommited changes, branch: main

05/22 14:34:28.837: Command: main.py -c config/aios_smplx_inference.py --options batch_size=8 epochs=100 lr_drop=55 num_body_points=17 backbone=resnet50 --resume data/checkpoint/aios_checkpoint.pth --eval --inference --to_vid --inference_input demo/short_video.mp4 --output_dir demo/demo [05/22 14:34:28.839]: Full config saved to demo/demo/config_args_all.json [05/22 14:34:28.839]: world size: 1 [05/22 14:34:28.839]: rank: 0 [05/22 14:34:28.839]: local_rank: 0 [05/22 14:34:28.839]: args: Namespace(agora_benchmark='na', amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=8, bbox_loss_coef=5.0, bbox_ratio=1.2, body_3d_size=2, body_bbox_loss_coef=5.0, body_giou_loss_coef=2.0, body_model_test={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_model_train={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_only=True, camera_3d_size=2.5, clip_max_norm=0.1, cls_loss_coef=2.0, cls_no_bias=False, code_dir=None, config_file='config/aios_smplx_inference.py', config_path='config/aios_smplx.py', continue_train=True, cur_dir='/home/seba/Documents/AiOS/config', data_dir='/home/seba/Documents/AiOS/config/../dataset', data_strategy='balance', dataset_list=['AGORA_MM', 'BEDLAM', 'COCO_NA'], ddetr_lr_param=False, debug=False, dec_layer_number=None, dec_layers=6, dec_n_points=4, dec_pred_bbox_embed_share=False, dec_pred_class_embed_share=False, dec_pred_pose_embed_share=False, decoder_module_seq=['sa', 'ca', 'ffn'], decoder_sa_type='sa', device='cuda', dilation=False, dim_feedforward=2048, distributed=True, dln_hw_noise=0.2, dln_xy_noise=0.2, dn_attn_mask_type_list=['match2dn', 'dn2dn', 'group2group'], dn_batch_gt_fuse=False, dn_bbox_coef=0.5, dn_box_noise_scale=0.4, dn_label_coef=0.3, dn_label_noise_ratio=0.5, dn_labelbook_size=100, dn_number=100, dropout=0.0, ema_decay=0.9997, ema_epoch=0, embed_init_tgt=False, enc_layers=6, enc_loss_coef=1.0, enc_n_points=4, end_epoch=150, epochs=100, eval=True, exp_name='output/exp52/dataset_debug', face_3d_size=0.3, face_bbox_loss_coef=5.0, face_giou_loss_coef=2.0, face_keypoints_loss_coef=10.0, face_oks_loss_coef=4.0, find_unused_params=False, finetune_ignore=None, fix_refpoints_hw=-1, focal=(5000, 5000), focal_alpha=0.25, frozen_weights=None, gamma=0.1, giou_loss_coef=2.0, gpu=0, hand_3d_size=0.3, hidden_dim=256, human_model_path='data/body_models', indices_idx_list=[1, 2, 3, 4, 5, 6, 7], inference=True, inference_input='demo/short_video.mp4', input_body_shape=(256, 192), input_face_shape=(192, 192), input_hand_shape=(256, 256), interm_loss_coef=1.0, keypoints_loss_coef=10.0, lhand_bbox_loss_coef=5.0, lhand_giou_loss_coef=2.0, lhand_keypoints_loss_coef=10.0, lhand_oks_loss_coef=0.5, local_rank=0, log_dir=None, losses=['smpl_pose', 'smpl_beta', 'smpl_expr', 'smpl_kp2d', 'smpl_kp3d', 'smpl_kp3d_ra', 'labels', 'boxes', 'keypoints'], lr=1.414e-05, lr_backbone=1.414e-06, lr_backbone_names=['backbone.0'], lr_drop=55, lr_drop_list=[30, 60], lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], make_same_len=False, masks=False, match_unstable_error=False, matcher_type='HungarianMatcher', model_dir=None, modelname='aios_smplx', multi_step_lr=True, nheads=8, nms_iou_threshold=-1, no_aug=False, no_interm_box_loss=False, no_mmpose_keypoint_evaluator=True, num_body_points=17, num_box_decoder_layers=2, num_classes=2, num_face_points=6, num_feature_levels=4, num_group=100, num_hand_face_decoder_layers=4, num_hand_points=6, num_patterns=0, num_queries=900, num_select=50, num_workers=0, oks_loss_coef=4.0, onecyclelr=False, options={'batch_size': 8, 'epochs': 100, 'lr_drop': 55, 'num_body_points': 17, 'backbone': 'resnet50'}, output_dir='demo/demo', output_face_hm_shape=(8, 8, 8), output_hand_hm_shape=(16, 16, 16), output_hm_shape=(16, 16, 12), param_dict_type='default', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, pretrained_model_path='../output/train_gta_synbody_ft_20230410_132110/model_dump/snapshot_2.pth.tar', princpt=(96.0, 128.0), query_dim=4, random_refpoints_xy=False, rank=0, result_dir='/home/seba/Documents/AiOS/config/../exps62/result', resume='data/checkpoint/aios_checkpoint.pth', return_interm_indices=[1, 2, 3], rhand_bbox_loss_coef=5.0, rhand_giou_loss_coef=2.0, rhand_keypoints_loss_coef=10.0, rhand_oks_loss_coef=0.5, rm_detach=None, rm_self_attn_layers=None, root_dir='/home/seba/Documents/AiOS/config/..', save_checkpoint_interval=1, save_log=False, scheduler='step', seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, set_cost_keypoints=10.0, set_cost_kpvis=0.0, set_cost_oks=4.0, smpl_beta_loss_coef=0.01, smpl_body_kp2d_ba_loss_coef=0.0, smpl_body_kp2d_loss_coef=1.0, smpl_body_kp3d_loss_coef=1.0, smpl_body_kp3d_ra_loss_coef=1.0, smpl_expr_loss_coef=0.01, smpl_face_kp2d_ba_loss_coef=0.0, smpl_face_kp2d_loss_coef=0.1, smpl_face_kp3d_loss_coef=0.1, smpl_face_kp3d_ra_loss_coef=0.1, smpl_lhand_kp2d_ba_loss_coef=0.0, smpl_lhand_kp2d_loss_coef=0.5, smpl_lhand_kp3d_loss_coef=0.1, smpl_lhand_kp3d_ra_loss_coef=0.1, smpl_pose_loss_body_coef=0.1, smpl_pose_loss_jaw_coef=0.1, smpl_pose_loss_lhand_coef=0.1, smpl_pose_loss_rhand_coef=0.1, smpl_pose_loss_root_coef=1.0, smpl_rhand_kp2d_ba_loss_coef=0.0, smpl_rhand_kp2d_loss_coef=0.5, smpl_rhand_kp3d_loss_coef=0.1, smpl_rhand_kp3d_ra_loss_coef=0.1, start_epoch=0, step_size=20, strong_aug=False, test=False, test_max_size=1333, test_sample_interval=100, test_sizes=[800], testset='INFERENCE', to_vid=True, total_data_len='auto', train_batch_size=32, train_max_size=1333, train_sample_interval=10, train_sizes=[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800], trainset_2d=[], trainset_3d=['AGORA_MM', 'BEDLAM', 'COCO_NA'], trainset_humandata=[], trainset_partition={'AGORA_MM': 0.4, 'BEDLAM': 0.7, 'COCO_NA': 1}, transformer_activation='relu', two_stage_bbox_embed_share=False, two_stage_class_embed_share=False, two_stage_default_hw=0.05, two_stage_keep_all_tokens=False, two_stage_learn_wh=False, two_stage_type='standard', use_cache=True, use_checkpoint=False, use_dn=True, use_ema=True, vis_dir=None, weight_decay=0.0001, world_size=1)

aios_smplx Traceback (most recent call last): File "main.py", line 437, in main(args) File "main.py", line 173, in main model, criterion, postprocessors, postprocessors_aios = build_model_main( File "main.py", line 86, in build_model_main from models.registry import MODULE_BUILD_FUNCS File "/home/seba/Documents/AiOS/models/init.py", line 1, in from .aios import build_aios_smplx File "/home/seba/Documents/AiOS/models/aios/init.py", line 1, in from .aios_smplx import build_aios_smplx File "/home/seba/Documents/AiOS/models/aios/aios_smplx.py", line 17, in from .transformer import build_transformer File "/home/seba/Documents/AiOS/models/aios/transformer.py", line 10, in from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer File "/home/seba/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in from .ops.modules import MSDeformAttn File "/home/seba/Documents/AiOS/models/aios/ops/modules/init.py", line 9, in from .ms_deform_attn import MSDeformAttn File "/home/seba/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in from ..functions import MSDeformAttnFunction File "/home/seba/Documents/AiOS/models/aios/ops/functions/init.py", line 9, in from .ms_deform_attn_func import MSDeformAttnFunction File "/home/seba/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in import MultiScaleDeformableAttention as MSDA ImportError: /home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPvmm ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4686) of binary: /home/seba/anaconda3/envs/aios/bin/python Traceback (most recent call last): File "/home/seba/anaconda3/envs/aios/bin/torchrun", line 8, in sys.exit(main()) File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-22_14:34:30 host : seba-GE66-Raider-10UH rank : 0 (local_rank: 0) exitcode : 1 (pid: 4686) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Thanks for your help!

WYJSJTU commented 1 month ago

Did you correctly build deformable detr with the following command：


# build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..

sebastianopazo1 commented 1 month ago

Thanks for your reply @WYJSJTU . I did build deformable detr correctly. I'm thinking that maybe the error is the version of the cuda library. Any other suggestions ?

iamthephd commented 3 weeks ago

@WYJSJTU I am also getting similar error in dataloder.

Traceback (most recent call last):
  File "main.py", line 437, in <module>
    main(args)
  File "main.py", line 337, in main
    inference(model,
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/engine.py", line 368, in inference
    for data_batch in metric_logger.log_every(
  File "/workspace/util/misc.py", line 246, in log_every
    for obj in iterable:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 980) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1001) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 437, in <module>
    main(args)
  File "main.py", line 337, in main
    inference(model,
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/engine.py", line 368, in inference
    for data_batch in metric_logger.log_every(
  File "/workspace/util/misc.py", line 246, in log_every
    for obj in iterable:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1001) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 891) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

snitchjinx commented 1 week ago

I'm also getting a similar error. It's hard to locate where the issue comes from. Any suggestions/help about this? Thanks!

aios_smplx data/body_models Traceback (most recent call last): File "main.py", line 437, in main(args) File "main.py", line 173, in main model, criterion, postprocessors, postprocessors_aios = build_model_main( File "main.py", line 86, in build_model_main from models.registry import MODULE_BUILD_FUNCS File "/home/liujy/Documents/AiOS/models/init.py", line 1, in from .aios import build_aios_smplx File "/home/liujy/Documents/AiOS/models/aios/init.py", line 1, in from .aios_smplx import build_aios_smplx File "/home/liujy/Documents/AiOS/models/aios/aios_smplx.py", line 19, in from .postprocesses import PostProcess_SMPLX, PostProcess_aios File "/home/liujy/Documents/AiOS/models/aios/postprocesses.py", line 21, in from util.human_models import smpl_x File "/home/liujy/Documents/AiOS/util/human_models.py", line 258, in smpl_x = SMPLX() File "/home/liujy/Documents/AiOS/util/human_models.py", line 26, in init smplx.create(cfg.human_model_path, File "/home/liujy/Documents/AiOS/util/smplx/smplx/body_models.py", line 2333, in create raise ValueError(f'Unknown model type {model_type}, exiting!') ValueError: Unknown model type body, exiting! ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 53657) of binary: /home/liujy/Documents/AiOS/venv/bin/python Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-26_14:36:06 rank : 0 (local_rank: 0) exitcode : 1 (pid: 53657) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

WYJSJTU commented 1 week ago

Thanks for your reply @WYJSJTU. I did build deformable detr correctly. I'm thinking that maybe the error is the version of the CUDA library. Any other suggestions?

This problem seems to be caused by mismatched PyTorch and CUDA versions. As discussed in henghuiding/MeViS issue #9.

WYJSJTU commented 1 week ago

human_model_path

It seems like your model_path for the smplx model does not exist, please check your human_model_path in the config/aios_smplx_inference.py, or check your body model file structure.

WYJSJTU commented 1 week ago

It seems like your model_path for the smplx model does not exist, please check your human_model_path in the config/aios_smplx_inference.py, or check your body model file structure.

It might be an issue of the video path, be sure to put the video you want to run under demo/short_video_out directory.

ttxskk / AiOS

Error running the demo #11

main.py FAILED

main.py FAILED