Closed luoshuiyue closed 1 month ago
the param passed to --version should be your local checkpoint directory or a huggingface format modelname. If your server can connect to huggingface correctly, simply use --version YxZhang/evf-sam2 and the code will automatically download checkpoint from huggingface. If not, manually download the checkpoint from https://huggingface.co/YxZhang/evf-sam2 and refer the exact directory to --version Sorry for the confusion from our documentation, we will make it clear soon.
Thanks, it works. I still encounter some different results between inference.py
and inference_video.py
which use the same model and the same input picture:
Why do the inference_video.py
get worse results?
We used to encounter similar phenomenon when we forgot to convert BGR to RGB during preprocess. Did you modify the code? If not, would you please provide us with the raw picture to help us figure out the reason? Thanks
I didn't modify the code, but I used the image dataset not the dataset coming from the video.Below is my first picture:
what is your prompt?
the person in black standing on the brown disc base whose shoulders and head are blocked
We observe the gap too. Temporarily we guess the problem lies in some unaligned inference setting between two predictor. We will dive deep into this problem and update this issue if we have any discovery. Thanks for reporting bug.
I find that the first few frames in the video inference results seems not good, but after that it become stable and good, so could it back propagation to optimize the results at the beginning.
I debug the code and find that there are three main reasons leading to this gap.
the preprocess of our training code and image inference code is https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/inference.py#L54-L61 where we process interpolation twice. the sam2 video predictor preprocess the images in the way of https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/segment_anything_2/sam2/utils/misc.py#L94 where they use numpy.resize api.
we set point prompt to None during image inference. https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/evf_sam.py#L183-L191 but the sam2 video predictor set it to zero points. https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/segment_anything_2/sam2/modeling/sam2_base.py#L311-L313
We set https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/evf_sam.py#L169 during training and image inference, but the sam2 video predictor demands multimask_output=True.
To summarize, the gap lies in the misalignment of preprocess and some model settings between training and inference. We will fix these problems in future release of checkpoints.
Thanks for your detailed explanation.
@CoderZhangYx Hi~ Why, based on the method of video transmission inference_video.py,
there will be several interrupted frames in the middle, one with a bag mask and next one with no bag mask, two times in a row. (given prompt "person", if I give the prompt "person with bag", the results will be worse. WHAT I WANT is the person standing on the disc base is completely picked out, no matter how complicated his clothing and accessories are, and the clothing accessories are picked together.)
Hi, thank you for using our model. Due to limited training data, our model maynot behave perfectly in all cases, especially in zero-shot video prediction. Our next release of checkpoint would include more training data, you may try then.
Our latest release of code have solved the gap. Reopen this issue or open another issue if you have further problems.
how to handle this: