Closed imzhangyd closed 1 year ago
It is a compromise strategy to evaluete models on sampled key frames with labels because existing datasets can not provide per-frame labels for evaluation due to their high cost. And you are correct. It is more reasonable to evaluate on each frame if per-frame labels are given. Besides, some recent works proposed to evaluate temporal consistency (TC), which can represent the consistency of segmentation results. In our paper, we provide TC scores in Table 9.
Thanks for your patience. I was wondering whether the main difference between video semantic segmentation and semi-supervised video semantic segmentation is whether training with unlabeled video.
Yes, you are correct.
Thank you very much! Some VSS methods aggregate features of neighborhood unlabeled frames to segment the current frame, so I think these methods also use unlabeled frames for training and they can be considered semi-supervised. Did I misunderstand something here?
Here is another problem I'm confusing. The task of video semantic segmentation is to segment each frame of videos. But only several frames are labeled in the test set, the test performance in experiments is on several images rather than whole videos. I think it can not represent the performance of video semantic segmentation methods. Did I misunderstand something here?