运行python app.py --transport webrtc后项目卡住了

cenos7服务器上安装好环境后，使用 python app.py --transport webrtc 运行项目，没有输出 start websocket server ，一直卡在下面，请问大佬这是什么问题？

环境：cenos7 CUDA11.8 pytorch2.1.2 pytorch3d0.7.5

(nerfstream) [root@centos metahuman-stream]# python app.py --transport webrtc Namespace(pose='data/data_kf.json', au='data/au.csv', torso_imgs='', O=False, data_range=[0, -1], workspace='data/video', seed=0, ckpt='data/pretrained/ngp_kf.pth', num_rays=65536, cuda_ray=True, max_steps=16, num_steps=16, upsample_steps=0, update_extra_interval=16, max_ray_batch=4096, warmup_step=10000, amb_aud_loss=1, amb_eye_loss=1, unc_loss=1, lambda_amb=0.0001, fp16=True, bg_img='white', fbg=False, exp_eye=True, fix_eye=-1, smooth_eye=True, torso_shrink=0.8, color_space='srgb', preload=0, bound=1, scale=4, offset=[0, 0, 0], dt_gamma=0.00390625, min_near=0.05, density_thresh=10, density_thresh_torso=0.01, patch_size=1, init_lips=False, finetune_lips=False, smooth_lips=True, torso=True, head_ckpt='', gui=False, W=450, H=450, radius=3.35, fovy=21.24, max_spp=1, att=2, aud='', emb=False, ind_dim=4, ind_num=10000, ind_dim_torso=8, amb_dim=2, part=False, part2=False, train_camera=False, smooth_path=True, smooth_path_window=7, asr=True, asr_wav='', asr_play=False, asr_model='cpierse/wav2vec2-large-xlsr-53-esperanto', asr_save_feats=False, fps=50, l=10, m=8, r=10, fullbody=False, fullbody_img='data/fullbody/img', fullbody_width=580, fullbody_height=1080, fullbody_offset_x=0, fullbody_offset_y=0, avatar_id='avator_1', bbox_shift=5, batch_size=16, customvideo=False, customvideo_img='data/customvideo/img', customvideo_imgnum=1, tts='edgetts', REF_FILE=None, REF_TEXT=None, TTS_SERVER='http://127.0.0.1:9880', model='ernerf', transport='webrtc', push_url='http://localhost:1985/rtc/v1/whip/?app=live&stream=livestream', listenport=8010, test=True, test_train=False) NeRFNetwork( (audio_net): AudioNet( (encoder_conv): Sequential( (0): Conv1d(44, 32, kernel_size=(3,), stride=(2,), padding=(1,)) (1): LeakyReLU(negative_slope=0.02, inplace=True) (2): Conv1d(32, 32, kernel_size=(3,), stride=(2,), padding=(1,)) (3): LeakyReLU(negative_slope=0.02, inplace=True) (4): Conv1d(32, 64, kernel_size=(3,), stride=(2,), padding=(1,)) (5): LeakyReLU(negative_slope=0.02, inplace=True) (6): Conv1d(64, 64, kernel_size=(3,), stride=(2,), padding=(1,)) (7): LeakyReLU(negative_slope=0.02, inplace=True) ) (encoder_fc1): Sequential( (0): Linear(in_features=64, out_features=64, bias=True) (1): LeakyReLU(negative_slope=0.02, inplace=True) (2): Linear(in_features=64, out_features=32, bias=True) ) ) (audio_att_net): AudioAttNet( (attentionConvNet): Sequential( (0): Conv1d(32, 16, kernel_size=(3,), stride=(1,), padding=(1,)) (1): LeakyReLU(negative_slope=0.02, inplace=True) (2): Conv1d(16, 8, kernel_size=(3,), stride=(1,), padding=(1,)) (3): LeakyReLU(negative_slope=0.02, inplace=True) (4): Conv1d(8, 4, kernel_size=(3,), stride=(1,), padding=(1,)) (5): LeakyReLU(negative_slope=0.02, inplace=True) (6): Conv1d(4, 2, kernel_size=(3,), stride=(1,), padding=(1,)) (7): LeakyReLU(negative_slope=0.02, inplace=True) (8): Conv1d(2, 1, kernel_size=(3,), stride=(1,), padding=(1,)) (9): LeakyReLU(negative_slope=0.02, inplace=True) ) (attentionNet): Sequential( (0): Linear(in_features=8, out_features=8, bias=True) (1): Softmax(dim=1) ) ) (encoder_xy): GridEncoder: input_dim=2 num_levels=12 level_dim=1 resolution=64 -> 512 per_level_scale=1.2081 params=(163584, 1) gridtype=hash align_corners=False (encoder_yz): GridEncoder: input_dim=2 num_levels=12 level_dim=1 resolution=64 -> 512 per_level_scale=1.2081 params=(163584, 1) gridtype=hash align_corners=False (encoder_xz): GridEncoder: input_dim=2 num_levels=12 level_dim=1 resolution=64 -> 512 per_level_scale=1.2081 params=(163584, 1) gridtype=hash align_corners=False (eye_att_net): MLP( (net): ModuleList( (0): Linear(in_features=36, out_features=16, bias=False) (1): Linear(in_features=16, out_features=1, bias=False) ) ) (sigma_net): MLP( (net): ModuleList( (0): Linear(in_features=69, out_features=64, bias=False) (1): Linear(in_features=64, out_features=64, bias=False) (2): Linear(in_features=64, out_features=65, bias=False) ) ) (encoder_dir): SHEncoder: input_dim=3 degree=4 (color_net): MLP( (net): ModuleList( (0): Linear(in_features=84, out_features=64, bias=False) (1): Linear(in_features=64, out_features=3, bias=False) ) ) (unc_net): MLP( (net): ModuleList( (0): Linear(in_features=36, out_features=32, bias=False) (1): Linear(in_features=32, out_features=1, bias=False) ) ) (aud_ch_att_net): MLP( (net): ModuleList( (0): Linear(in_features=36, out_features=64, bias=False) (1): Linear(in_features=64, out_features=32, bias=False) ) ) (torso_deform_encoder): FreqEncoder: input_dim=2 degree=8 output_dim=34 (anchor_encoder): FreqEncoder: input_dim=6 degree=3 output_dim=42 (torso_deform_net): MLP( (net): ModuleList( (0): Linear(in_features=84, out_features=32, bias=False) (1): Linear(in_features=32, out_features=32, bias=False) (2): Linear(in_features=32, out_features=2, bias=False) ) ) (torso_encoder): GridEncoder: input_dim=2 num_levels=16 level_dim=2 resolution=16 -> 2048 per_level_scale=1.3819 params=(555520, 2) gridtype=tiled align_corners=False (torso_net): MLP( (net): ModuleList( (0): Linear(in_features=116, out_features=32, bias=False) (1): Linear(in_features=32, out_features=32, bias=False) (2): Linear(in_features=32, out_features=4, bias=False) ) ) ) Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] /usr/local/bin/miniconda3/envs/nerfstream/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /usr/local/bin/miniconda3/envs/nerfstream/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=AlexNet_Weights.IMAGENET1K_V1. You can also use weights=AlexNet_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Loading model from: /usr/local/bin/miniconda3/envs/nerfstream/lib/python3.10/site-packages/lpips/weights/v0.1/alex.pth [INFO] Trainer: ngp | 2024-07-15_09-48-25 | cuda | fp16 | data/video [INFO] #parameters: 1789121 [INFO] Loading data/pretrained/ngp_kf.pth ... [INFO] loaded model. [INFO] load at epoch 28, global step 203616 [WARN] Failed to load optimizer. [INFO] loaded scheduler. [INFO] loaded scaler. [INFO] load 7272 frames. Loading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7272/7272 [00:00<00:00, 63322.74it/s] [INFO] eye_area: 0.0 - 1.0 [INFO] loading ASR model cpierse/wav2vec2-large-xlsr-53-esperanto... /usr/local/bin/miniconda3/envs/nerfstream/lib/python3.10/site-packages/transformers/configuration_utils.py:364: UserWarning: Passing gradient_checkpointing to a config initialization is deprecated and will be removed in v5 Transformers. Using model.gradient_checkpointing_enable() instead, or if you are using the Trainer API, pass gradient_checkpointing=True in your TrainingArguments. warnings.warn( Some weights of the model checkpoint at cpierse/wav2vec2-large-xlsr-53-esperanto were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']

This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at cpierse/wav2vec2-large-xlsr-53-esperanto and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO] warm up ASR live model, expected latency = 0.560000s [INFO] warm-up done, actual latency = 0.292924s

lipku / metahuman-stream

运行python app.py --transport webrtc后项目卡住了 #149