Models just predict a mean blurry solution on the grasping task

Hey! I had a question about models on the grasping task: I have recently re-run the scripts for it and visualized the outputs because the metrics looked very noisy to me - they varied a lot over the 5-fold validation and didn't exactly match with the paper -- ~0.7 top-1/p99 v/s 0.9 reported in the paper -- although maybe it might be more stable using a higher batch size than the default 64 in the repo? I already tried 64 and 128 with the same resulting behaviour.

I see that the v-cond model (loaded by default) just learns to predict a blurry solution where everything in the middle of the image is graspable and everything outside is the negative class. I think in this case a model would not be said to have learnt anything as it doesn't seem to be doing any object-wise predictions, and it would not make sense to compare representations on this task if they were all leading to predictions of this kind. In some of the cross fold validation runs, I also see that it sometimes just resorts to predicting "not graspable" everywhere (I assume because its the majority class). Could this be because of a newly introduced bug? I was wondering if you had stored outputs from the model perviously so that I can compare to those?

One bug that I did fix was : probs = torch.softmax(logits, dim=2) -> probs = torch.softmax(logits, dim=1) in the validation_step() because it was earlier taking the softmax over the height dimension of the predicted outputs but didn't see any other issues so far.

Thanks! (I'm attaching an image with the probs visualized here, will also share thresholded version soon) predicted_probs_1_50_506b0b15fa97bed35101

siddk / voltron-evaluation

Models just predict a mean blurry solution on the grasping task #10