hustvl / EVF-SAM

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Apache License 2.0
307 stars 13 forks source link

run error #12

Closed luoshuiyue closed 1 month ago

luoshuiyue commented 2 months ago

how to handle this: 1724207937297 1724207949902

CoderZhangYx commented 2 months ago

the param passed to --version should be your local checkpoint directory or a huggingface format modelname. If your server can connect to huggingface correctly, simply use --version YxZhang/evf-sam2 and the code will automatically download checkpoint from huggingface. If not, manually download the checkpoint from https://huggingface.co/YxZhang/evf-sam2 and refer the exact directory to --version Sorry for the confusion from our documentation, we will make it clear soon.

luoshuiyue commented 2 months ago

Thanks, it works. I still encounter some different results between inference.py and inference_video.py which use the same model and the same input picture: 微信图片_20240821142825

Why do the inference_video.py get worse results?

CoderZhangYx commented 2 months ago

We used to encounter similar phenomenon when we forgot to convert BGR to RGB during preprocess. Did you modify the code? If not, would you please provide us with the raw picture to help us figure out the reason? Thanks

luoshuiyue commented 2 months ago

I didn't modify the code, but I used the image dataset not the dataset coming from the video.Below is my first picture: 000

CoderZhangYx commented 2 months ago

what is your prompt?

luoshuiyue commented 2 months ago

the person in black standing on the brown disc base whose shoulders and head are blocked

CoderZhangYx commented 2 months ago

We observe the gap too. Temporarily we guess the problem lies in some unaligned inference setting between two predictor. We will dive deep into this problem and update this issue if we have any discovery. Thanks for reporting bug.

luoshuiyue commented 2 months ago

I find that the first few frames in the video inference results seems not good, but after that it become stable and good, so could it back propagation to optimize the results at the beginning.

CoderZhangYx commented 2 months ago

I debug the code and find that there are three main reasons leading to this gap.

  1. the preprocess of our training code and image inference code is https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/inference.py#L54-L61 where we process interpolation twice. the sam2 video predictor preprocess the images in the way of https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/segment_anything_2/sam2/utils/misc.py#L94 where they use numpy.resize api.

  2. we set point prompt to None during image inference. https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/evf_sam.py#L183-L191 but the sam2 video predictor set it to zero points. https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/segment_anything_2/sam2/modeling/sam2_base.py#L311-L313

  3. We set https://github.com/hustvl/EVF-SAM/blob/8bfac9e376cc5f289cc382c1e1fe1223f687f9d4/model/evf_sam.py#L169 during training and image inference, but the sam2 video predictor demands multimask_output=True.

To summarize, the gap lies in the misalignment of preprocess and some model settings between training and inference. We will fix these problems in future release of checkpoints.

luoshuiyue commented 2 months ago

Thanks for your detailed explanation.

luoshuiyue commented 1 month ago

@CoderZhangYx Hi~ Why, based on the method of video transmission inference_video.py, there will be several interrupted frames in the middle, one with a bag mask and next one with no bag mask, two times in a row. (given prompt "person", if I give the prompt "person with bag", the results will be worse. WHAT I WANT is the person standing on the disc base is completely picked out, no matter how complicated his clothing and accessories are, and the clothing accessories are picked together.) image

CoderZhangYx commented 1 month ago

Hi, thank you for using our model. Due to limited training data, our model maynot behave perfectly in all cases, especially in zero-shot video prediction. Our next release of checkpoint would include more training data, you may try then.

CoderZhangYx commented 1 month ago

Our latest release of code have solved the gap. Reopen this issue or open another issue if you have further problems.