I found in your code that the feature-metric loss only used the first output of the feature extractor(resnet) which meant it only adopted the output of the 7x7 convolution with stride-2. I wonder why not use the last output of the encoder which has a larger receptive field.
in line 193 of mono_fm/net.py:
src_f = self.extractor(img)[0]
I found in your code that the feature-metric loss only used the first output of the feature extractor(resnet) which meant it only adopted the output of the 7x7 convolution with stride-2. I wonder why not use the last output of the encoder which has a larger receptive field. in line 193 of mono_fm/net.py:
src_f = self.extractor(img)[0]