Open sebastianopazo1 opened 1 month ago
Did you correctly build deformable detr with the following command:
# build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..
Thanks for your reply @WYJSJTU . I did build deformable detr correctly. I'm thinking that maybe the error is the version of the cuda library. Any other suggestions ?
@WYJSJTU I am also getting similar error in dataloder.
Traceback (most recent call last):
File "main.py", line 437, in <module>
main(args)
File "main.py", line 337, in main
inference(model,
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/workspace/engine.py", line 368, in inference
for data_batch in metric_logger.log_every(
File "/workspace/util/misc.py", line 246, in log_every
for obj in iterable:
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 980) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1001) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "main.py", line 437, in <module>
main(args)
File "main.py", line 337, in main
inference(model,
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/workspace/engine.py", line 368, in inference
for data_batch in metric_logger.log_every(
File "/workspace/util/misc.py", line 246, in log_every
for obj in iterable:
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1001) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 891) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm also getting a similar error. It's hard to locate where the issue comes from. Any suggestions/help about this? Thanks!
Failures:
Thanks for your reply @WYJSJTU. I did build deformable detr correctly. I'm thinking that maybe the error is the version of the CUDA library. Any other suggestions?
This problem seems to be caused by mismatched PyTorch and CUDA versions. As discussed in henghuiding/MeViS
issue #9.
human_model_path
It seems like your model_path
for the smplx model does not exist, please check your human_model_path
in the config/aios_smplx_inference.py
, or check your body model file structure.
It seems like your
model_path
for the smplx model does not exist, please check yourhuman_model_path
in theconfig/aios_smplx_inference.py
, or check your body model file structure.
It might be an issue of the video path, be sure to put the video you want to run under demo/short_video_out
directory.
Hi! I'm having some trouble with the demo. I installed all the required libraries, except for the version of cuda which is 11.1. I'm getting the following error.
Before torch.distributed.barrier() End torch.distributed.barrier() Loading config file from config/aios_smplx_inference.py
sha: be1ea5a357503c2502dd4d3fe38f826bff4f4e52, status: has uncommited changes, branch: main
05/22 14:34:28.837: Command: main.py -c config/aios_smplx_inference.py --options batch_size=8 epochs=100 lr_drop=55 num_body_points=17 backbone=resnet50 --resume data/checkpoint/aios_checkpoint.pth --eval --inference --to_vid --inference_input demo/short_video.mp4 --output_dir demo/demo [05/22 14:34:28.839]: Full config saved to demo/demo/config_args_all.json [05/22 14:34:28.839]: world size: 1 [05/22 14:34:28.839]: rank: 0 [05/22 14:34:28.839]: local_rank: 0 [05/22 14:34:28.839]: args: Namespace(agora_benchmark='na', amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=8, bbox_loss_coef=5.0, bbox_ratio=1.2, body_3d_size=2, body_bbox_loss_coef=5.0, body_giou_loss_coef=2.0, body_model_test={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_model_train={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_only=True, camera_3d_size=2.5, clip_max_norm=0.1, cls_loss_coef=2.0, cls_no_bias=False, code_dir=None, config_file='config/aios_smplx_inference.py', config_path='config/aios_smplx.py', continue_train=True, cur_dir='/home/seba/Documents/AiOS/config', data_dir='/home/seba/Documents/AiOS/config/../dataset', data_strategy='balance', dataset_list=['AGORA_MM', 'BEDLAM', 'COCO_NA'], ddetr_lr_param=False, debug=False, dec_layer_number=None, dec_layers=6, dec_n_points=4, dec_pred_bbox_embed_share=False, dec_pred_class_embed_share=False, dec_pred_pose_embed_share=False, decoder_module_seq=['sa', 'ca', 'ffn'], decoder_sa_type='sa', device='cuda', dilation=False, dim_feedforward=2048, distributed=True, dln_hw_noise=0.2, dln_xy_noise=0.2, dn_attn_mask_type_list=['match2dn', 'dn2dn', 'group2group'], dn_batch_gt_fuse=False, dn_bbox_coef=0.5, dn_box_noise_scale=0.4, dn_label_coef=0.3, dn_label_noise_ratio=0.5, dn_labelbook_size=100, dn_number=100, dropout=0.0, ema_decay=0.9997, ema_epoch=0, embed_init_tgt=False, enc_layers=6, enc_loss_coef=1.0, enc_n_points=4, end_epoch=150, epochs=100, eval=True, exp_name='output/exp52/dataset_debug', face_3d_size=0.3, face_bbox_loss_coef=5.0, face_giou_loss_coef=2.0, face_keypoints_loss_coef=10.0, face_oks_loss_coef=4.0, find_unused_params=False, finetune_ignore=None, fix_refpoints_hw=-1, focal=(5000, 5000), focal_alpha=0.25, frozen_weights=None, gamma=0.1, giou_loss_coef=2.0, gpu=0, hand_3d_size=0.3, hidden_dim=256, human_model_path='data/body_models', indices_idx_list=[1, 2, 3, 4, 5, 6, 7], inference=True, inference_input='demo/short_video.mp4', input_body_shape=(256, 192), input_face_shape=(192, 192), input_hand_shape=(256, 256), interm_loss_coef=1.0, keypoints_loss_coef=10.0, lhand_bbox_loss_coef=5.0, lhand_giou_loss_coef=2.0, lhand_keypoints_loss_coef=10.0, lhand_oks_loss_coef=0.5, local_rank=0, log_dir=None, losses=['smpl_pose', 'smpl_beta', 'smpl_expr', 'smpl_kp2d', 'smpl_kp3d', 'smpl_kp3d_ra', 'labels', 'boxes', 'keypoints'], lr=1.414e-05, lr_backbone=1.414e-06, lr_backbone_names=['backbone.0'], lr_drop=55, lr_drop_list=[30, 60], lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], make_same_len=False, masks=False, match_unstable_error=False, matcher_type='HungarianMatcher', model_dir=None, modelname='aios_smplx', multi_step_lr=True, nheads=8, nms_iou_threshold=-1, no_aug=False, no_interm_box_loss=False, no_mmpose_keypoint_evaluator=True, num_body_points=17, num_box_decoder_layers=2, num_classes=2, num_face_points=6, num_feature_levels=4, num_group=100, num_hand_face_decoder_layers=4, num_hand_points=6, num_patterns=0, num_queries=900, num_select=50, num_workers=0, oks_loss_coef=4.0, onecyclelr=False, options={'batch_size': 8, 'epochs': 100, 'lr_drop': 55, 'num_body_points': 17, 'backbone': 'resnet50'}, output_dir='demo/demo', output_face_hm_shape=(8, 8, 8), output_hand_hm_shape=(16, 16, 16), output_hm_shape=(16, 16, 12), param_dict_type='default', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, pretrained_model_path='../output/train_gta_synbody_ft_20230410_132110/model_dump/snapshot_2.pth.tar', princpt=(96.0, 128.0), query_dim=4, random_refpoints_xy=False, rank=0, result_dir='/home/seba/Documents/AiOS/config/../exps62/result', resume='data/checkpoint/aios_checkpoint.pth', return_interm_indices=[1, 2, 3], rhand_bbox_loss_coef=5.0, rhand_giou_loss_coef=2.0, rhand_keypoints_loss_coef=10.0, rhand_oks_loss_coef=0.5, rm_detach=None, rm_self_attn_layers=None, root_dir='/home/seba/Documents/AiOS/config/..', save_checkpoint_interval=1, save_log=False, scheduler='step', seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, set_cost_keypoints=10.0, set_cost_kpvis=0.0, set_cost_oks=4.0, smpl_beta_loss_coef=0.01, smpl_body_kp2d_ba_loss_coef=0.0, smpl_body_kp2d_loss_coef=1.0, smpl_body_kp3d_loss_coef=1.0, smpl_body_kp3d_ra_loss_coef=1.0, smpl_expr_loss_coef=0.01, smpl_face_kp2d_ba_loss_coef=0.0, smpl_face_kp2d_loss_coef=0.1, smpl_face_kp3d_loss_coef=0.1, smpl_face_kp3d_ra_loss_coef=0.1, smpl_lhand_kp2d_ba_loss_coef=0.0, smpl_lhand_kp2d_loss_coef=0.5, smpl_lhand_kp3d_loss_coef=0.1, smpl_lhand_kp3d_ra_loss_coef=0.1, smpl_pose_loss_body_coef=0.1, smpl_pose_loss_jaw_coef=0.1, smpl_pose_loss_lhand_coef=0.1, smpl_pose_loss_rhand_coef=0.1, smpl_pose_loss_root_coef=1.0, smpl_rhand_kp2d_ba_loss_coef=0.0, smpl_rhand_kp2d_loss_coef=0.5, smpl_rhand_kp3d_loss_coef=0.1, smpl_rhand_kp3d_ra_loss_coef=0.1, start_epoch=0, step_size=20, strong_aug=False, test=False, test_max_size=1333, test_sample_interval=100, test_sizes=[800], testset='INFERENCE', to_vid=True, total_data_len='auto', train_batch_size=32, train_max_size=1333, train_sample_interval=10, train_sizes=[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800], trainset_2d=[], trainset_3d=['AGORA_MM', 'BEDLAM', 'COCO_NA'], trainset_humandata=[], trainset_partition={'AGORA_MM': 0.4, 'BEDLAM': 0.7, 'COCO_NA': 1}, transformer_activation='relu', two_stage_bbox_embed_share=False, two_stage_class_embed_share=False, two_stage_default_hw=0.05, two_stage_keep_all_tokens=False, two_stage_learn_wh=False, two_stage_type='standard', use_cache=True, use_checkpoint=False, use_dn=True, use_ema=True, vis_dir=None, weight_decay=0.0001, world_size=1)
aios_smplx Traceback (most recent call last): File "main.py", line 437, in
main(args)
File "main.py", line 173, in main
model, criterion, postprocessors, postprocessors_aios = build_model_main(
File "main.py", line 86, in build_model_main
from models.registry import MODULE_BUILD_FUNCS
File "/home/seba/Documents/AiOS/models/init.py", line 1, in
from .aios import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/init.py", line 1, in
from .aios_smplx import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/aios_smplx.py", line 17, in
from .transformer import build_transformer
File "/home/seba/Documents/AiOS/models/aios/transformer.py", line 10, in
from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
File "/home/seba/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in
from .ops.modules import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/init.py", line 9, in
from .ms_deform_attn import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in
from ..functions import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/init.py", line 9, in
from .ms_deform_attn_func import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in
import MultiScaleDeformableAttention as MSDA
ImportError: /home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPvmm
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4686) of binary: /home/seba/anaconda3/envs/aios/bin/python
Traceback (most recent call last):
File "/home/seba/anaconda3/envs/aios/bin/torchrun", line 8, in
sys.exit(main())
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
main.py FAILED
Failures: