Closed non-void closed 1 year ago
Hi, I find that you double the batchsize, which will reduce the number of interations. From my log (I trained it with 8gpus, 1024bs and 40epochs), the loss in the last iterations should be less than 0.300. It seems that the model you trained is under-fitting. Could you change the number of batchsize to the original value or change the number of epochs to 20?
Thanks for reply. The training time of 10 epoches is about 30 hours for one 4090. Which is long for me.
I wonder the training can be accelerated with multiple GPUs, however I do not have slurm. I also tried to train using
PYTHONPATH=".":$PYTHONPATH python tools/train.py ${CONFIG_FILE} ${WORK_DIR} --no-validate --gpus 2
But result in failure,
AssertionError: MMDataParallel only supports single GPU training, if you need to train with multiple GPUs, please use MMDistributedDataParallel instead.
Could you please help to point out how to use numtiple GPU without slurm?
Hi, I add a new script at tools/dist_train.sh, you can use pytorch launcher to run a 2-gpu training by the command below:
sh tools/dist_train.sh configs/remodiffuse/remodiffuse_t2m.py logs/test2 2 --no-validate --gpu-ids 0 1
Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:
How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!
Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.
Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:
- How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!
- Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.
Sorry for the late reply. I have a fever recently and haven't check the email.
Thanks for your reply, I retried to retrain the network and the result is close to the paper's.
I have another question. I noticed that you add strong smoothing in the visualize.py:
joint = motion_temporal_filter(joint, sigma=2.5)
If I remove the line, the resultant motion jitters a lot. Why? Is it related to the network structure, or other reasons?
Thanks for your reply, I retried to retrain the network and the result is close to the paper's.
I have another question. I noticed that you add strong smoothing in the visualize.py:
joint = motion_temporal_filter(joint, sigma=2.5)
If I remove the line, the resultant motion jitters a lot. Why? Is it related to the network structure, or other reasons?
It is related to the diffusion setting. If you choose cosine scheduler, the noise scale will reduce dramatically during the last several iterations and the whole pipeline can generate smooth animation without post-processing. However, cosine scheduler has two drawbacks (according to my experiments): 1) the quantitative result is worse than that of linear scheduler; 2) if you still use 50-step strategy to speed up, the quantative results and qualitative results are both unsatisfactory .
I'm currently working on this problem but have little progress now.
I tried to use the cosine scheduler, the result fid is about 8.1 (which is dramatically larger than linear scheduler). Other metrics are also much worse than those using linear scheduler.
Is it correct? or I made some mistake?
I tried to use the cosine scheduler, the result fid is about 8.1 (which is dramatically larger than linear scheduler). Other metrics are also much worse than those using linear scheduler.
Is it correct? or I made some mistake?
Hi, could you please share your config?
I only changed the beta_scheduler to "cosine", the other settings are same.
data_keys = ['motion', 'motion_mask', 'motion_length', 'clip_feat']
meta_keys = ['text', 'token']
train_pipeline = [
dict(
type='Normalize',
mean_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
std_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
),
dict(type='Crop', crop_size=196),
dict(
type='ToTensor',
keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
dict(
type='Collect',
keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
meta_keys=['text', 'token'])
]
data = dict(
samples_per_gpu=128,
workers_per_gpu=1,
train=dict(
type='RepeatDataset',
dataset=dict(
type='TextMotionDataset',
dataset_name='human_ml3d',
data_prefix='ReMoDiffuse_data/data',
pipeline=[
dict(
type='Normalize',
mean_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
std_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
),
dict(type='Crop', crop_size=196),
dict(
type='ToTensor',
keys=[
'motion', 'motion_mask', 'motion_length', 'clip_feat'
]),
dict(
type='Collect',
keys=[
'motion', 'motion_mask', 'motion_length', 'clip_feat'
],
meta_keys=['text', 'token'])
],
ann_file='train.txt',
motion_dir='motions',
text_dir='texts',
token_dir='tokens',
clip_feat_dir='clip_feats'),
times=200),
test=dict(
type='TextMotionDataset',
dataset_name='human_ml3d',
data_prefix='ReMoDiffuse_data/data',
pipeline=[
dict(
type='Normalize',
mean_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/mean.npy',
std_path=
'ReMoDiffuse_data/data/datasets/human_ml3d/std.npy'
),
dict(type='Crop', crop_size=196),
dict(
type='ToTensor',
keys=['motion', 'motion_mask', 'motion_length', 'clip_feat']),
dict(
type='Collect',
keys=['motion', 'motion_mask', 'motion_length', 'clip_feat'],
meta_keys=['text', 'token'])
],
ann_file='test.txt',
motion_dir='motions',
text_dir='texts',
token_dir='tokens',
clip_feat_dir='clip_feats',
eval_cfg=dict(
shuffle_indexes=True,
replication_times=1,
replication_reduction='statistics',
text_encoder_name='human_ml3d',
text_encoder_path=
'ReMoDiffuse_data/data/evaluators/human_ml3d/finest.tar',
motion_encoder_name='human_ml3d',
motion_encoder_path=
'ReMoDiffuse_data/data/evaluators/human_ml3d/finest.tar',
metrics=[
dict(type='R Precision', batch_size=32, top_k=3),
dict(type='Matching Score', batch_size=32),
dict(type='FID'),
dict(type='Diversity', num_samples=300),
dict(
type='MultiModality',
num_samples=100,
num_repeats=30,
num_picks=10)
]),
test_mode=True))
checkpoint_config = dict(interval=1)
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
optimizer = dict(type='Adam', lr=0.0002)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='CosineAnnealing', min_lr_ratio=2e-05, by_epoch=False)
runner = dict(type='EpochBasedRunner', max_epochs=40)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
input_feats = 263
max_seq_len = 196
latent_dim = 512
time_embed_dim = 2048
text_latent_dim = 256
ff_size = 1024
num_heads = 8
dropout = 0
model = dict(
type='MotionDiffusion',
model=dict(
type='ReMoDiffuseTransformer',
input_feats=263,
max_seq_len=196,
latent_dim=512,
time_embed_dim=2048,
num_layers=4,
ca_block_cfg=dict(
type='SemanticsModulatedAttention',
latent_dim=512,
text_latent_dim=256,
num_heads=8,
dropout=0,
time_embed_dim=2048),
ffn_cfg=dict(
latent_dim=512, ffn_dim=1024, dropout=0, time_embed_dim=2048),
text_encoder=dict(
pretrained_model='clip',
latent_dim=256,
num_layers=2,
ff_size=2048,
dropout=0,
use_text_proj=False),
retrieval_cfg=dict(
num_retrieval=2,
stride=4,
num_layers=2,
num_motion_layers=2,
kinematic_coef=0.1,
topk=2,
retrieval_file=
'ReMoDiffuse_data/data/database/t2m_text_train.npz',
latent_dim=512,
output_dim=512,
max_seq_len=196,
num_heads=8,
ff_size=1024,
dropout=0,
ffn_cfg=dict(latent_dim=512, ffn_dim=1024, dropout=0),
sa_block_cfg=dict(
type='EfficientSelfAttention',
latent_dim=512,
num_heads=8,
dropout=0)),
scale_func_cfg=dict(
coarse_scale=6.5,
both_coef=0.52351,
text_coef=-0.28419,
retr_coef=2.39872)),
loss_recon=dict(type='MSELoss', loss_weight=1, reduction='none'),
diffusion_train=dict(
beta_scheduler='cosine',
diffusion_steps=1000,
model_mean_type='start_x',
model_var_type='fixed_large'),
diffusion_test=dict(
beta_scheduler='cosine',
diffusion_steps=1000,
model_mean_type='start_x',
model_var_type='fixed_large',
respace='15,15,8,6,6'),
inference_type='ddim')
work_dir = 'ReMoDiffuse_logs/20230926_linear'
gpu_ids = range(0, 8)
You may remove the option resapce='15,15,8,6,6'
and evaluate again. This option will cause noticeable performance drop when you use cosine scheduler.
hello, we tried to remove the option, the result is follows:
Is it correct?
R_precision Top 1 (mean) : 0.4886
R_precision Top 1 (conf) : 0.0000
R_precision Top 2 (mean) : 0.6799
R_precision Top 2 (conf) : 0.0000
R_precision Top 3 (mean) : 0.7788
R_precision Top 3 (conf) : 0.0000
Matching Score (mean) : 3.1570
Matching Score (conf) : 0.0000
FID (mean) : 0.4478
FID (conf) : 0.0000
Diversity (mean) : 8.9949
Diversity (conf) : 0.0000
hello, we tried to remove the option, the result is follows:
Is it correct?
R_precision Top 1 (mean) : 0.4886 R_precision Top 1 (conf) : 0.0000 R_precision Top 2 (mean) : 0.6799 R_precision Top 2 (conf) : 0.0000 R_precision Top 3 (mean) : 0.7788 R_precision Top 3 (conf) : 0.0000 Matching Score (mean) : 3.1570 Matching Score (conf) : 0.0000 FID (mean) : 0.4478 FID (conf) : 0.0000 Diversity (mean) : 8.9949 Diversity (conf) : 0.0000
Yes, it is similar to my results.
Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:
- How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!
- Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.
hello, may I ask if it is necessary to use 8 gpus to make revon_loss reach 0.3?
Thanks, I tried to train the network on 8gpu with bs 128 and with 40 epoches, the recon_loss reaches around 0.27 and the fid reaches about 0.2 and R precision(Top 3) reaches 0.78. I have two questions:
- How to reach the result of your paper? Do I have to make the batchsize larger like yours? How do you use 1024bs? Your gpus' memory are so large!
- Does DDPM yield better results than DDIM? I noticed that you use 50-step DDIM in inference.
Sorry for the late reply. I have a fever recently and haven't check the email.
- My batchsize is similar to you, which requires 8 gpus and 128bs on each. I'm not sure why your FID is around 0.2. Could you please share your configure file, log and checkpoint?
- DDPM performs a little bit worse than DDIM if we use the same 50-step strategy. 50-step strategy is used for speeding up the whole generation process without noticeable performance drop (so ReMoDiffuse is 20x faster than other diffusion-based motion generation pipeline).
where is the mention of ddim using 50 steps, I only find the ddim using the fixed 1000 steps in gaussion_diffusion.py...
Thanks for sharing the code.
I tried to run the training code using one 4090, and changed the epoch number to one fourth. (Line 17 of remodiffuse_t2m.py)
runner = dict(type='EpochBasedRunner', max_epochs=10)
However, after running, I found that the result FID does not match the value in the paper. In the paper, FID of HumanML3D is 0.103 and the R_precision of top3 is 0.795. However, I tried to test the resultant checkpoints (with 1 replication to save time) and the result FID only reaches 0.59, which is far from the value of the paper (may not be the issue of number of replication) Where is wrong?
The training log is as follows: