What do you use for baseline implementation?

Thank you for your question! We use our own implementation for this baseline, following the architecture as described in Appendix D.5 of our paper. Essentially, the CNN encoder architecture is the same as in the C-SWM model with the difference that the last CNN layer produces 32 feature maps (instead of num_objects) and that we flatten this representation over width and height, i.e. into a single [width*height*32] feature vector (as opposed to [num_objects, width*height] feature vectors for the C-SWM model). We then apply the same Encoder MLP as in C-SWM to this flattened representation to arrive at the final 32-dim embedding of the image (32-dim mean + 32-dim variance vector for a VAE). As for the decoders, please see: https://github.com/tkipf/c-swm/blob/master/modules.py

tkipf / c-swm

What do you use for baseline implementation? #6