Model Struggling to Learn Cloth Dynamics

I am running training with the T-Shirt garment on around ~700 animations downloaded from Mixamo (via mixamo_anims_downloader)

In the config, I have increased batch_size to 300 (any more and I exhaust the 12GB of VRAM on my RTX 4070) but have left everything else default (i.e. temporal_window_size = 0.5 and reflect_probability = motion_augmentation = 0.0.

Unfortunately after training this model for 100 epochs (does not sound like much, but each is made up of 209 batches, of 300 sequences) I am struggling to see much evidence of cloth dynamics. e.g (technically these gifs are from an earlier epoch but the output barely changed after training for longer):

RunWithTorch

Compared to setting motion=0.0:

RunWithTorchNoMotion

Maybe it is just really subtle, but I can barely see a difference.

I think this is the same animation 1:26 in https://youtu.be/6HxXLBzRXFg?t=86 so there's clearly a pretty huge gap between the results I'm getting and what was obtained for that video.

On that 100th epoch, I got the following metrics:

m/Loss: 1.0557 - m/Stretch: 6.5842e-06 - m/Shear: 0.1285 - m/Bending: 0.1992 - m/Collision: 0.0041 - m/Gravity: 1.0049 - m/Inertia: 0.0049

FYI after just 5 batches (not epochs!) of training, the metrics were:

m/Loss: 1.6685 - m/Stretch: 2.6247e-05 - m/Shear: 0.2460 - m/Bending: 0.0590 - m/Collision: 0.0359 - m/Gravity: 1.0188 - m/Inertia: 0.0046

So Stretch, Shear, Bending, Collision and Gravity (to a lesser extent) losses all improved quite a bit, but it looks like inertia loss barely changed (if anything, it got worse?). Perhaps inertia loss starting off so tiny could be the cause for the model seemingly not learning dynamics?

I also tried a very un-scientific test: printing the mean and max magnitude of values in the dynamic and static encodings (just before running the decoder):

x_static_abs = tf.abs(x_static)
x_dynamic_abs = tf.abs(x_dynamic)
for i in range(tf_shape(x_static)[1]):
  x_static_slice = x_static_abs[:, i, :]
  x_dynamic_slice = x_dynamic_abs[:, i, :]
  x_static_mean = tf.math.reduce_mean(x_static_slice)
  x_static_max = tf.math.reduce_max(x_static_slice)
  x_dynamic_mean = tf.math.reduce_mean(x_dynamic_slice)
  x_dynamic_max = tf.math.reduce_max(x_dynamic_slice)
  print(f"Means: Static - {x_static_mean}, Dynamic - {x_dynamic_mean}\nMaximums: Static - {x_static_max}, Dynamic - {x_dynamic_max}")

and got (for the last frame on the above animation):

Means: Static - 0.015779726207256317, Dynamic - 0.013485459610819817
Maximums: Static - 0.14864373207092285, Dynamic - 0.062263891100883484

Obviously average/maximum magnitude across values in the encodings won't always entirely correlate with size of output deformations, but at least it looks like the dynamic encoder is having some influence on the final output, just not anything that resembles coherent cloth dynamics.

My main question then: am I doing anything obviously wrong? I guess the most obvious thing left to try is to train for more epochs, but the paper did mention that simple garments should only take an hour to train (max a day). Training for 100 epochs took about a day on a 4070 and losses seem to be decreasing very slowly. Regardless, I will try running training over a few more days and update this issue if I managed to get a better result... Any other ideas for what I might be doing wrong (do I need to be training with a larger batch size?) or other assistance in general (perhaps sharing the exact config/set of Mixamo animations used to train the model shown in the paper?) would be very much appreciated. Thanks!!

Follow up on this. I have now trained a model for 1000 epochs. The model has learnt to ignore dynamics completely:

Means: Static-0.010450625792145729 Dynamic-0.0
Maximums: Static-0.3061593174934387 Dynamic-0.0

I also tried printing out the number of zeroes in both the static and latent codes:

print(
  f"Zeroes: Static-{tf.reduce_sum(tf.cast(tf.equal(x_static_slice, 0), tf.int32))} Dynamic-{tf.reduce_sum(tf.cast(tf.equal(x_dynamic_slice, 0), tf.int32))}"
)

and got:

Zeroes: Static-457 Dynamic-512

So even the static latent code appears to be majority zeroes (both latent codes contain just 512 elements). i.e. perhaps the problems are not limited to the dynamic encoder; it seems like the model is just ending up with a ton of stuck ReLUs somehow (and it is especially bad w.r.t. dynamics because inertia loss is more unstable)?

I've experimented with various tweaks like freezing the weights of the dynamic/static encoder to train each individually, scaling up inertia loss, scaling up inertia loss over time, training with larger batch size (via training on CPU, I managed to get a batch size of 1024 running), training with smaller batch size, trying different temporal window sizes and motion augmentation percentages etc... and have not managed to achieve any results close to those shown in the paper and video...

As before, any ideas on what I might be doing wrong would be hugely appreciated!

hbertiche / NeuralClothSim

Model Struggling to Learn Cloth Dynamics #21