gateloop_use_heinsen=True on MeshTransformer results in NaN loss

lucidrains / meshgpt-pytorch

Implementation of MeshGPT, SOTA Mesh generation using Attention, in Pytorch

MIT License

700 stars 57 forks source link

gateloop_use_heinsen=True on MeshTransformer results in NaN loss #47

Closed Kurokabe closed 8 months ago

Kurokabe commented 8 months ago

The loss quickly became NaN when training on ShapeNet after filtering for mesh < 800 faces after decimation (resulting in ~15k different 3D models) with condition_on_text=True so I thought it was similar to #44 but even after training without text, adjusting the learning rate, adding warmup, larger batch size... I still had this problem.

I found with detect_anomaly that the NaN come from gateloop_transformer>heinsen_associative_scan on the log backward.

Previously I was able to train successfully on 5 ShapeNet categories, with 10 meshes x 256 transformations each = 12'800 3D models, but it was with meshgpt-pytorch version 0.4.2. Maybe updating to version 0.5.5 also updated gateloop-transformer which added this bug. Anyway, setting gateloop_use_heinsen=False seems to have solved the problem for me.

lucidrains commented 8 months ago

ah, thanks for reporting

I'll default it off from now on

lucidrains commented 8 months ago

@Kurokabe does the use of gateloop layers help with convergence?

Kurokabe commented 8 months ago

With the default values it's converging, but I didn't try with coarse_pre_gateloop_depth=0 and fine_pre_gateloop_depth=0 to compare. If I find the time to run without GateLoop I'll report here

lucidrains commented 8 months ago

@Kurokabe please do, i'm curious to know

Kurokabe commented 8 months ago

@Kurokabe does the use of gateloop layers help with convergence?

So I tried with coarse_pre_gateloop_depth=0 and fine_pre_gateloop_depth=0, I didn't train it as long as my previous model, but using the Gateloop seems to improve a bit the performance

Also, in your coarse gateloop and fine gateloop call, you pass the cache parameter, but in the case of coarse_pre_gateloop_depth=0 and fine_pre_gateloop_depth=0 it won't work since the nn.Identity() doesn't have a cache parameter https://github.com/lucidrains/meshgpt-pytorch/blob/5dc77d9b3915558bdb5f3edacbb22468fe27848c/meshgpt_pytorch/meshgpt_pytorch.py#L1483

I temporarily fixed it as follows:

if isinstance(self.coarse_gateloop_block, GateLoopBlock):
    face_codes, coarse_gateloop_cache = self.coarse_gateloop_block(face_codes, cache = coarse_gateloop_cache)
else:
    coarse_gateloop_cache = None
    face_codes = self.coarse_gateloop_block(face_codes)

lucidrains commented 8 months ago

@Kurokabe ah nice, thank you for sharing that!

also fixed the issue with being able to run without gateloop layers