Open bengioe opened 9 months ago
I'm of a mind to merge this actually. It's not the cleanest implementation possible but there are significant gains here (as mentioned, a 30% speedup with the default settings on seh_frag.py
). Will test across tasks and report back.
Made significant simplifications to the method by subclassing Pickler
/Unpickler
, found some very tricky bugs (I was making a bad usage of pinned CUDA buffers and ended up with rare race conditions). Speedups remain (might even be a bit faster).
Merged with trunk + made a few fixes. Pretty happy with this now!
This PR implements a better way of sharing torch tensors between process by creating (large enough) shared tensors that are created once are used as a transfer mechanism. Doing this on the fragment environment
seh_frag.py
I'm getting a 30% wall time improvement for simple settings, with batch size 64 (I'm sure we could have fun maxing that out and see how far we can take GPU utilization).Some notes:
Batch
andGraphActionCategoricals
through shared buffers improves timeOther changes:
GFNAlgorithm
global_cfg
is set for all algorithmscond_info
is now folded into the batch object rather than being passed as an argument everywhereGraphActionCategorical.entropy
when masks are used, gradients wrt logits would be NaN.Note,
EnvelopeQL
is still in a broken state, will fix in #127