Open MonolithFoundation opened 1 month ago
The automatic mask predictor is sampling a grid of points and calling the decoder again and again, I have actually tried this with the Onnx model (not implemented here) but found it quite slow compared to pytorch, I guess because the decoder does not use flash attn 2 when exported to onnx and this adds up if you inference a few hundret times / image to get all masks.
Have u referenced another sam2 onnx implementation? looks like they make all works, including video tracking.
How to get all masks directly?