I reimplemented this using the CLIP models from HuggingFace at https://github.com/tanganke/fusion_bench
The results of the experiments in the paper are placed at results.
Some of the code in this repository is based on the following repositories: