What I meant for n-shot training was to pass a batch of 1-query & n-support images and (perhaps) average the n mask results to get a single mask for loss.
My apology. I don't understand what you mean by "to get five support images ... for 5-shot training".
For example, our model provides 5 mask predicions where each pixel is in {0,1} given 5 support images, and these predictions are element-wise summed (you think of this as a voting process for each pixel position) to provide a single mask in {0, ..., 5}. Dividing by the maximum voting score means all the voting scores are normalized so that they are in [0, 1].
Hello, Thanks for your great work on FSS! I have some confusion about Implementation details of 5-shot training.