[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Why is it that during the computation of segmentation results, the model() function is used instead of model.generate()? Wouldn't this mean that when predicting the next token, the information viewed is from the actual token rather than the predicted one? #67
When computing the segmentation results, the model() function is employed. def evaluate_model_performance(validation_loader, model, args):
Trackers for metrics
When computing the caption results, model.generate() is utilized.
def inference(instructions, inputs):
Extract the inputs