isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
10 stars 4 forks source link

GNN training runs out of GPU memory when dealing with many-stroke scenes #1201

Open spigo900 opened 1 year ago

spigo900 commented 1 year ago

GNN training runs out of GPU memory when dealing with many-stroke scenes. For example, this happens when training on the M5 objects curriculum using STEGO segmentations with stroke merging enabled and no color segmentation refinement. In this example you end up with 599 train inputs and a significant minority of "many-stroke" scenes. One scene has s_max = 42 strokes, the most strokes of any input. GNN training can't fit the activations for a dataset like this into memory, so it crashes during the forward pass.

Specifically, we run into trouble with the message-passing part of the GNN (aka MPNN). The number of edge outputs scales quadratically with s_max. It ends up using about 8 GiB of memory for the edge outputs alone. This is one of the very first tensors computed and it doesn't leave a lot of room for any further tensors. I haven't tested yet if this also happens at decode time; it's possible that we can scrape by if we don't have to also hold on to tensors for the backward pass. It's something I plan to test.

This isn't a terrible problem for two reasons. First, it only affects these two variants; it's unlikely to affect either the segmentation experiments or the spatial relations experiments. Second, I've found a good-enough hack/workaround for color segmentation -- run the GNN training for the two affected variants on ephemeral-lg. With 48 GiB of GPU memory, we don't even have to worry about adjusting the code. :)

spigo900 commented 1 year ago

Okay, from a quick test it seems we don't need this much memory at decode time. The GNN is happy with a standard plain old ephemeral GPU. Phew. :)

ETA: I spoke too soon. Decoding the eval/test data requires ephemeral-lg. Oh well. 🙃