Fail to reproduce the result in the paper

Thanks for sharing the code.

I tried to run the training code using one 4090, and changed the epoch number to one fourth. (Line 17 of remodiffuse_t2m.py)

runner = dict(type='EpochBasedRunner', max_epochs=10)

However, after running, I found that the result FID does not match the value in the paper. In the paper, FID of HumanML3D is 0.103 and the R_precision of top3 is 0.795. However, I tried to test the resultant checkpoints (with 1 replication to save time) and the result FID only reaches 0.59, which is far from the value of the paper (may not be the issue of number of replication) Where is wrong?

################### result of checkpoint on epoch_10 ##########################

R_precision Top 1 (mean) : 0.4670

R_precision Top 1 (conf) : 0.0000

R_precision Top 2 (mean) : 0.6550

R_precision Top 2 (conf) : 0.0000

R_precision Top 3 (mean) : 0.7511

R_precision Top 3 (conf) : 0.0000

Matching Score (mean) : 3.2621

Matching Score (conf) : 0.0000

FID (mean) : 0.5952

FID (conf) : 0.0000

Diversity (mean) : 8.4494

Diversity (conf) : 0.0000

MultiModality (mean) : 2.7630

MultiModality (conf) : 0.0000

################### result of checkpoint on epoch_9 ##########################

R_precision Top 1 (mean) : 0.4700

R_precision Top 1 (conf) : 0.0000

R_precision Top 2 (mean) : 0.6567

R_precision Top 2 (conf) : 0.0000

R_precision Top 3 (mean) : 0.7577

R_precision Top 3 (conf) : 0.0000

Matching Score (mean) : 3.2448

Matching Score (conf) : 0.0000

FID (mean) : 0.6071

FID (conf) : 0.0000

Diversity (mean) : 8.9575

Diversity (conf) : 0.0000

MultiModality (mean) : 2.4209

MultiModality (conf) : 0.0000

################### result of checkpoint on epoch_8 ##########################

R_precision Top 1 (mean) : 0.4658

R_precision Top 1 (conf) : 0.0000

R_precision Top 2 (mean) : 0.6533

R_precision Top 2 (conf) : 0.0000

R_precision Top 3 (mean) : 0.7532

R_precision Top 3 (conf) : 0.0000

Matching Score (mean) : 3.2461

Matching Score (conf) : 0.0000

FID (mean) : 0.5870

FID (conf) : 0.0000

Diversity (mean) : 9.0003

Diversity (conf) : 0.0000

MultiModality (mean) : 3.1988

MultiModality (conf) : 0.0000

The training log is as follows:


2023-09-12 17:43:16,674 - mogen - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.18 | packaged by conda-forge | (main, Aug 30 2023, 03:49:32) [GCC 12.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.13.1
OpenCV: 4.8.0
MMCV: 1.7.1
MMCV Compiler: GCC 9.4
MMCV CUDA Compiler: 11.3
mogen: 0.0.1+e7458d7
------------------------------------------------------------

2023-09-12 17:43:16,674 - mogen - INFO - Distributed training: False
2023-09-12 17:43:16,714 - mogen - INFO - Config:
data_keys = ['motion', 'motion_mask', 'motion_length', 'clip_feat']
meta_keys = ['text', 'token']
train_pipeline = [
    dict(
        type='Normalize',
        mean_path='data/datasets/human_ml3d/mean.npy',
        std_path='data/datasets/human_ml3d/std.npy'),
    dict(type='Crop', crop_size=196),
    dict(
        type='ToTensor',
        keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
    dict(
        type='Collect',
        keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
        meta_keys=['text', 'token'])
]
data = dict(
    samples_per_gpu=224,
    workers_per_gpu=1,
    train=dict(
        type='RepeatDataset',
        dataset=dict(
            type='TextMotionDataset',
            dataset_name='human_ml3d',
            data_prefix='data',
            pipeline=[
                dict(
                    type='Normalize',
                    mean_path='data/datasets/human_ml3d/mean.npy',
                    std_path='data/datasets/human_ml3d/std.npy'),
                dict(type='Crop', crop_size=196),
                dict(
                    type='ToTensor',
                    keys=[
                        'motion', 'motion_mask', 'motion_length', 'clip_feat'
                    ]),
                dict(
                    type='Collect',
                    keys=[
                        'motion', 'motion_mask', 'motion_length', 'clip_feat'
                    ],
                    meta_keys=['text', 'token'])
            ],
            ann_file='train.txt',
            motion_dir='motions',
            text_dir='texts',
            token_dir='tokens',
            clip_feat_dir='clip_feats'),
        times=200),
    test=dict(
        type='TextMotionDataset',
        dataset_name='human_ml3d',
        data_prefix='data',
        pipeline=[
            dict(
                type='Normalize',
                mean_path='data/datasets/human_ml3d/mean.npy',
                std_path='data/datasets/human_ml3d/std.npy'),
            dict(type='Crop', crop_size=196),
            dict(
                type='ToTensor',
                keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
            dict(
                type='Collect',
                keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
                meta_keys=['text', 'token'])
        ],
        ann_file='test.txt',
        motion_dir='motions',
        text_dir='texts',
        token_dir='tokens',
        clip_feat_dir='clip_feats',
        eval_cfg=dict(
            shuffle_indexes=True,
            replication_times=1,
            replication_reduction='statistics',
            text_encoder_name='human_ml3d',
            text_encoder_path='data/evaluators/human_ml3d/finest.tar',
            motion_encoder_name='human_ml3d',
            motion_encoder_path='data/evaluators/human_ml3d/finest.tar',
            metrics=[
                dict(type='R Precision', batch_size=32, top_k=3),
                dict(type='Matching Score', batch_size=32),
                dict(type='FID'),
                dict(type='Diversity', num_samples=300),
                dict(
                    type='MultiModality',
                    num_samples=100,
                    num_repeats=30,
                    num_picks=10)
            ]),
        test_mode=True))
checkpoint_config = dict(interval=1)
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = 'logs/test/latest.pth'
workflow = [('train', 1)]
optimizer = dict(type='Adam', lr=0.0002)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='CosineAnnealing', min_lr_ratio=2e-05, by_epoch=False)
runner = dict(type='EpochBasedRunner', max_epochs=10)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
input_feats = 263
max_seq_len = 196
latent_dim = 512
time_embed_dim = 2048
text_latent_dim = 256
ff_size = 1024
num_heads = 8
dropout = 0
model = dict(
    type='MotionDiffusion',
    model=dict(
        type='ReMoDiffuseTransformer',
        input_feats=263,
        max_seq_len=196,
        latent_dim=512,
        time_embed_dim=2048,
        num_layers=4,
        ca_block_cfg=dict(
            type='SemanticsModulatedAttention',
            latent_dim=512,
            text_latent_dim=256,
            num_heads=8,
            dropout=0,
            time_embed_dim=2048),
        ffn_cfg=dict(
            latent_dim=512, ffn_dim=1024, dropout=0, time_embed_dim=2048),
        text_encoder=dict(
            pretrained_model='clip',
            latent_dim=256,
            num_layers=2,
            ff_size=2048,
            dropout=0,
            use_text_proj=False),
        retrieval_cfg=dict(
            num_retrieval=2,
            stride=4,
            num_layers=2,
            num_motion_layers=2,
            kinematic_coef=0.1,
            topk=2,
            retrieval_file='data/database/t2m_text_train.npz',
            latent_dim=512,
            output_dim=512,
            max_seq_len=196,
            num_heads=8,
            ff_size=1024,
            dropout=0,
            ffn_cfg=dict(latent_dim=512, ffn_dim=1024, dropout=0),
            sa_block_cfg=dict(
                type='EfficientSelfAttention',
                latent_dim=512,
                num_heads=8,
                dropout=0)),
        scale_func_cfg=dict(
            coarse_scale=6.5,
            both_coef=0.52351,
            text_coef=-0.28419,
            retr_coef=2.39872)),
    loss_recon=dict(type='MSELoss', loss_weight=1, reduction='none'),
    diffusion_train=dict(
        beta_scheduler='linear',
        diffusion_steps=1000,
        model_mean_type='start_x',
        model_var_type='fixed_large'),
    diffusion_test=dict(
        beta_scheduler='linear',
        diffusion_steps=1000,
        model_mean_type='start_x',
        model_var_type='fixed_large',
        respace='15,15,8,6,6'),
    inference_type='ddim')
work_dir = 'mylogs'
gpu_ids = range(0, 1)

2023-09-12 17:43:57,044 - mogen - INFO - load checkpoint from local path: logs/test/latest.pth
2023-09-12 17:43:57,540 - mogen - INFO - resumed epoch 7, iter 150724
2023-09-12 17:43:57,540 - mogen - INFO - Start running, host: panxiaoyu@panxiaoyu-System-Product-Name, work_dir: /home/panxiaoyu/Code/ReMoDiffuse/mylogs
2023-09-12 17:43:57,541 - mogen - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(LOW         ) IterTimerHook                      
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) OptimizerHook                      
(NORMAL      ) CheckpointHook                     
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2023-09-12 17:43:57,541 - mogen - INFO - workflow: [('train', 1)], max: 10 epochs
2023-09-12 17:43:57,541 - mogen - INFO - Checkpoints will be saved to /home/panxiaoyu/Code/ReMoDiffuse/mylogs by HardDiskBackend.
2023-09-12 17:48:32,504 - mogen - INFO - Epoch [8][50/21917]    lr: 4.433e-05, eta: 4 days, 8:28:45, time: 5.499, data_time: 0.051, memory: 19100, recon_loss: 0.5646, loss: 0.5646
.....
2023-09-13 01:50:36,371 - mogen - INFO - Epoch [10][21500/21917]    lr: 1.035e-07, eta: 0:23:10, time: 0.430, data_time: 0.006, memory: 19100, recon_loss: 0.4819, loss: 0.4819
2023-09-13 01:50:57,819 - mogen - INFO - Epoch [10][21550/21917]    lr: 1.004e-07, eta: 0:22:47, time: 0.429, data_time: 0.006, memory: 19100, recon_loss: 0.5282, loss: 0.5282
2023-09-13 01:51:19,415 - mogen - INFO - Epoch [10][21600/21917]    lr: 9.725e-08, eta: 0:22:25, time: 0.432, data_time: 0.006, memory: 19100, recon_loss: 0.5114, loss: 0.5114
2023-09-13 01:51:40,978 - mogen - INFO - Epoch [10][21650/21917]    lr: 9.418e-08, eta: 0:22:02, time: 0.431, data_time: 0.006, memory: 19100, recon_loss: 0.5182, loss: 0.5182
2023-09-13 01:52:02,602 - mogen - INFO - Epoch [10][21700/21917]    lr: 9.116e-08, eta: 0:21:40, time: 0.432, data_time: 0.006, memory: 19100, recon_loss: 0.5233, loss: 0.5233
2023-09-13 01:52:24,239 - mogen - INFO - Epoch [10][21750/21917]    lr: 8.819e-08, eta: 0:21:18, time: 0.433, data_time: 0.006, memory: 19100, recon_loss: 0.5197, loss: 0.5197
2023-09-13 01:52:45,771 - mogen - INFO - Epoch [10][21800/21917]    lr: 8.528e-08, eta: 0:20:55, time: 0.431, data_time: 0.006, memory: 19100, recon_loss: 0.5126, loss: 0.5126
2023-09-13 01:53:07,273 - mogen - INFO - Epoch [10][21850/21917]    lr: 8.242e-08, eta: 0:20:33, time: 0.430, data_time: 0.006, memory: 19100, recon_loss: 0.5422, loss: 0.5422
2023-09-13 01:53:28,876 - mogen - INFO - Epoch [10][21900/21917]    lr: 7.960e-08, eta: 0:20:11, time: 0.432, data_time: 0.006, memory: 19100, recon_loss: 0.5117, loss: 0.5117
2023-09-13 01:53:35,959 - mogen - INFO - Saving checkpoint at 10 epochs

Hi, I find that you double the batchsize, which will reduce the number of interations. From my log (I trained it with 8gpus, 1024bs and 40epochs), the loss in the last iterations should be less than 0.300. It seems that the model you trained is under-fitting. Could you change the number of batchsize to the original value or change the number of epochs to 20?

Thanks for reply. The training time of 10 epoches is about 30 hours for one 4090. Which is long for me.

I wonder the training can be accelerated with multiple GPUs, however I do not have slurm. I also tried to train using

PYTHONPATH=".":$PYTHONPATH python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --no-validate --gpus 2

But result in failure,

AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.

Could you please help to point out how to use numtiple GPU without slurm?

Hi, I add a new script at tools/dist_train.sh, you can use pytorch launcher to run a 2-gpu training by the command below:

sh tools/dist_train.sh configs/remodiffuse/remodiffuse_t2m.py logs/test2 2 --no-validate --gpu-ids 0 1

Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:

How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!
Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.

Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:

How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!

Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.

Sorry for the late reply. I have a fever recently and haven't check the email.

My batchsize is similar to you, which requires 8 gpus and 128bs on each. I'm not sure why your FID is around 0.2. Could you please share your configure file, log and checkpoint?
DDPM performs a little bit worse than DDIM if we use the same 50-step strategy. 50-step strategy is used for speeding up the whole generation process without noticeable performance drop (so ReMoDiffuse is 20x faster than other diffusion-based motion generation pipeline).

Thanks for your reply, I retried to retrain the network and the result is close to the paper's.

I have another question. I noticed that you add strong smoothing in the visualize.py:

joint = motion_temporal_filter(joint, sigma=2.5)

If I remove the line, the resultant motion jitters a lot. Why? Is it related to the network structure, or other reasons?

Thanks for your reply, I retried to retrain the network and the result is close to the paper's.

I have another question. I noticed that you add strong smoothing in the visualize.py:

joint = motion_temporal_filter(joint, sigma=2.5)

If I remove the line, the resultant motion jitters a lot. Why? Is it related to the network structure, or other reasons?

It is related to the diffusion setting. If you choose cosine scheduler, the noise scale will reduce dramatically during the last several iterations and the whole pipeline can generate smooth animation without post-processing. However, cosine scheduler has two drawbacks (according to my experiments): 1) the quantitative result is worse than that of linear scheduler; 2) if you still use 50-step strategy to speed up, the quantative results and qualitative results are both unsatisfactory .

I'm currently working on this problem but have little progress now.

I tried to use the cosine scheduler, the result fid is about 8.1 (which is dramatically larger than linear scheduler). Other metrics are also much worse than those using linear scheduler.

Is it correct? or I made some mistake?

I tried to use the cosine scheduler, the result fid is about 8.1 (which is dramatically larger than linear scheduler). Other metrics are also much worse than those using linear scheduler.

Is it correct? or I made some mistake?

Hi, could you please share your config?

I only changed the beta_scheduler to "cosine", the other settings are same.


data_keys = ['motion', 'motion_mask', 'motion_length', 'clip_feat']
meta_keys = ['text', 'token']
train_pipeline = [
    dict(
        type='Normalize',
        mean_path=
        'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
        std_path=
        'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
    ),
    dict(type='Crop', crop_size=196),
    dict(
        type='ToTensor',
        keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
    dict(
        type='Collect',
        keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
        meta_keys=['text', 'token'])
]
data = dict(
    samples_per_gpu=128,
    workers_per_gpu=1,
    train=dict(
        type='RepeatDataset',
        dataset=dict(
            type='TextMotionDataset',
            dataset_name='human_ml3d',
            data_prefix='ReMoDiffuse_data/data',
            pipeline=[
                dict(
                    type='Normalize',
                    mean_path=
                    'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
                    std_path=
                    'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
                ),
                dict(type='Crop', crop_size=196),
                dict(
                    type='ToTensor',
                    keys=[
                        'motion', 'motion_mask', 'motion_length', 'clip_feat'
                    ]),
                dict(
                    type='Collect',
                    keys=[
                        'motion', 'motion_mask', 'motion_length', 'clip_feat'
                    ],
                    meta_keys=['text', 'token'])
            ],
            ann_file='train.txt',
            motion_dir='motions',
            text_dir='texts',
            token_dir='tokens',
            clip_feat_dir='clip_feats'),
        times=200),
    test=dict(
        type='TextMotionDataset',
        dataset_name='human_ml3d',
        data_prefix='ReMoDiffuse_data/data',
        pipeline=[
            dict(
                type='Normalize',
                mean_path=
                'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
                std_path=
                'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
            ),
            dict(type='Crop', crop_size=196),
            dict(
                type='ToTensor',
                keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
            dict(
                type='Collect',
                keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
                meta_keys=['text', 'token'])
        ],
        ann_file='test.txt',
        motion_dir='motions',
        text_dir='texts',
        token_dir='tokens',
        clip_feat_dir='clip_feats',
        eval_cfg=dict(
            shuffle_indexes=True,
            replication_times=1,
            replication_reduction='statistics',
            text_encoder_name='human_ml3d',
            text_encoder_path=
            'ReMoDiffuse_data/data/evaluators/human_ml3d/finest.tar',
            motion_encoder_name='human_ml3d',
            motion_encoder_path=
            'ReMoDiffuse_data/data/evaluators/human_ml3d/finest.tar',
            metrics=[
                dict(type='R Precision', batch_size=32, top_k=3),
                dict(type='Matching Score', batch_size=32),
                dict(type='FID'),
                dict(type='Diversity', num_samples=300),
                dict(
                    type='MultiModality',
                    num_samples=100,
                    num_repeats=30,
                    num_picks=10)
            ]),
        test_mode=True))
checkpoint_config = dict(interval=1)
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
optimizer = dict(type='Adam', lr=0.0002)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='CosineAnnealing', min_lr_ratio=2e-05, by_epoch=False)
runner = dict(type='EpochBasedRunner', max_epochs=40)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
input_feats = 263
max_seq_len = 196
latent_dim = 512
time_embed_dim = 2048
text_latent_dim = 256
ff_size = 1024
num_heads = 8
dropout = 0
model = dict(
    type='MotionDiffusion',
    model=dict(
        type='ReMoDiffuseTransformer',
        input_feats=263,
        max_seq_len=196,
        latent_dim=512,
        time_embed_dim=2048,
        num_layers=4,
        ca_block_cfg=dict(
            type='SemanticsModulatedAttention',
            latent_dim=512,
            text_latent_dim=256,
            num_heads=8,
            dropout=0,
            time_embed_dim=2048),
        ffn_cfg=dict(
            latent_dim=512, ffn_dim=1024, dropout=0, time_embed_dim=2048),
        text_encoder=dict(
            pretrained_model='clip',
            latent_dim=256,
            num_layers=2,
            ff_size=2048,
            dropout=0,
            use_text_proj=False),
        retrieval_cfg=dict(
            num_retrieval=2,
            stride=4,
            num_layers=2,
            num_motion_layers=2,
            kinematic_coef=0.1,
            topk=2,
            retrieval_file=
            'ReMoDiffuse_data/data/database/t2m_text_train.npz',
            latent_dim=512,
            output_dim=512,
            max_seq_len=196,
            num_heads=8,
            ff_size=1024,
            dropout=0,
            ffn_cfg=dict(latent_dim=512, ffn_dim=1024, dropout=0),
            sa_block_cfg=dict(
                type='EfficientSelfAttention',
                latent_dim=512,
                num_heads=8,
                dropout=0)),
        scale_func_cfg=dict(
            coarse_scale=6.5,
            both_coef=0.52351,
            text_coef=-0.28419,
            retr_coef=2.39872)),
    loss_recon=dict(type='MSELoss', loss_weight=1, reduction='none'),
    diffusion_train=dict(
        beta_scheduler='cosine',
        diffusion_steps=1000,
        model_mean_type='start_x',
        model_var_type='fixed_large'),
    diffusion_test=dict(
        beta_scheduler='cosine',
        diffusion_steps=1000,
        model_mean_type='start_x',
        model_var_type='fixed_large',
        respace='15,15,8,6,6'),
    inference_type='ddim')
work_dir = 'ReMoDiffuse_logs/20230926_linear'
gpu_ids = range(0, 8)

You may remove the option resapce='15,15,8,6,6' and evaluate again. This option will cause noticeable performance drop when you use cosine scheduler.

hello, we tried to remove the option, the result is follows:

Is it correct?


R_precision Top 1 (mean) : 0.4886

R_precision Top 1 (conf) : 0.0000

R_precision Top 2 (mean) : 0.6799

R_precision Top 2 (conf) : 0.0000

R_precision Top 3 (mean) : 0.7788

R_precision Top 3 (conf) : 0.0000

Matching Score (mean) : 3.1570

Matching Score (conf) : 0.0000

FID (mean) : 0.4478

FID (conf) : 0.0000

Diversity (mean) : 8.9949

Diversity (conf) : 0.0000

hello, we tried to remove the option, the result is follows:

Is it correct?


R_precision Top 1 (mean) : 0.4886

R_precision Top 1 (conf) : 0.0000

R_precision Top 2 (mean) : 0.6799

R_precision Top 2 (conf) : 0.0000

R_precision Top 3 (mean) : 0.7788

R_precision Top 3 (conf) : 0.0000

Matching Score (mean) : 3.1570

Matching Score (conf) : 0.0000

FID (mean) : 0.4478

FID (conf) : 0.0000

Diversity (mean) : 8.9949

Diversity (conf) : 0.0000

Yes, it is similar to my results.

Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:

How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!

Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.

hello, may I ask if it is necessary to use 8 gpus to make revon_loss reach 0.3?

Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:

How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!

Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.

Sorry for the late reply. I have a fever recently and haven't check the email.

My batchsize is similar to you, which requires 8 gpus and 128bs on each. I'm not sure why your FID is around 0.2. Could you please share your configure file, log and checkpoint?

DDPM performs a little bit worse than DDIM if we use the same 50-step strategy. 50-step strategy is used for speeding up the whole generation process without noticeable performance drop (so ReMoDiffuse is 20x faster than other diffusion-based motion generation pipeline).

where is the mention of ddim using 50 steps, I only find the ddim using the fixed 1000 steps in gaussion_diffusion.py...

mingyuan-zhang / ReMoDiffuse

Fail to reproduce the result in the paper #5