Application of code to images in the wild

PolarRobin commented 5 months ago

How would I apply the method to an arbitrary image just to get out the occlusion mask? Is there a way?

haibo-qiu commented 5 months ago

Hi @PolarRobin,

Our paper focuses on taking an arbitrary face image as input and outputting the face representation for a recognition task in an end-to-end manner. It uses the Occlusion Pattern Predictor to predict the pattern of occlusion during training to assist the recognition model in learning.

During the inference, the Occlusion Pattern Predictor is actually discarded. However, you can still use this module to generate a specific occlusion pattern for your input, which can be regarded as a coarse segmented mask for the input face image. If you are looking to acquire a more accurate face mask, I suggest you search for related methods with keywords like facial mask detection/segmentation.

PolarRobin commented 5 months ago

Hi @haibo-qiu , thanks for your answer! I was actually just looking for a way to check whether or not the mouth of a face is occluded. So, it would suffice to have a simple rectangle (e.g. in the KxK grid) that I can use to compare to face landmark detection. It sounds like there is a way to use your occlusion pattern to this purpose. Can you point me to the specific function that can return the occlusion grid pattern from the images in fig. 5, 11 and 13 in your paper? :)

haibo-qiu commented 5 months ago

Hi @PolarRobin,

If you want to obtain the specific pattern from the network, you should focus on this function. It generates the probability for each pattern, and you can use argmax to get the corresponding index for patterns. Regarding how to generate all the occlusion patterns, you may refer to this function.

PolarRobin commented 5 months ago

Thanks a lot for the quick answer, much appreciated! I implemented the version now. It seems to work, though the results are mixed. I will run a larger test set soon. Anyways, thanks again!

PolarRobin commented 5 months ago

After some more testing, I found that it unfortunately doesn't work very well, almost not at all I would say.

I am using it like in the code below. After optaining the most_probable_pattern, I compare it with a list of the patterns that touch a specific tile in the 4by4 or 5by5 grid that I am interested in. I checked that function and it returns the correct patterns.

Is there an error in how I get the most probably pattern or do I have to prepare the data in a specific way maybe?

def get_models_params():
    models_root = 'FROM_occlusion_detection/pretrained/'
    models_names = ['model_p5_w1_9938_9470_6503.pth.tar']

    models_params = []
    for name in models_names:
        model_path = os.path.join(models_root, name)
        assert os.path.exists(model_path), 'invalid model name!'

        checkpoint = torch.load(model_path)
        state_dict = checkpoint['state_dict']
        # model.module.load_state_dict(state_dict)

        model_name = name.split('.')[0]
        models_params.append((model_name, state_dict))
    return models_params

def load_model(model_path):
    checkpoint = torch.load(model_path)
    state_dict = checkpoint['state_dict']

    # Check the actual required parameters for get_grids
    # Assuming it only needs dimensions which should be specified somewhere in the configuration
    H, W = 192, 192 #config.NETWORK.IMAGE_SIZE

    model_name, state_dict = get_models_params()[0]
    pattern = int(model_name[model_name.find('p')+1])
    num_mask = len(utils.get_grids(H, W, pattern))  # Example grid usage without pattern

    model = LResNet50E_IR_FPN(num_mask=num_mask)
    model.load_state_dict(state_dict, strict=False)
    model.eval()
    model.cuda()  # Assuming CUDA is available
    return model

def load_image(image):    
    # Define the transformations to normalize the image
    # Assume the input size your model expects is 112x96
    transform = transforms.Compose([
        transforms.Resize((112, 96)),  # Resize to the exact size
        transforms.ToTensor(),         # Convert the image to tensor
        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalization
    ])

    # Apply the transformations to the image
    image = transform(image).unsqueeze(0)  # Add batch dimension
    return image

def calculate_mouth_occlusion(image_path, model):
    image_tensor = load_image(image_path)

    # Check if CUDA is available and move the tensor to GPU if it is
    if torch.cuda.is_available():
        image = image_tensor.cuda()
        model.cuda()

    # Perform inference
    with torch.no_grad():
        fc_mask, mask, vec, fc = model(image, None)

    most_probable_pattern = np.argmax(vec.cpu().numpy())
    return most_probable_pattern

haibo-qiu commented 5 months ago

Hi @PolarRobin,

I think there might be two potential issues on the code you provided.

The function utils.get_grids(H, W, pattern) should take the input size (112, 96) to calculate all the grid patterns. This is because it defines specific coordinates related to the provided size. To link the predicted pattern to the image accurately, the same size needs to be provided. For further reference, you may refer to its definition in the code.
There may be a loss of accuracy if you directly resize the 192x192 image to 112x96. To avoid this, you should use the cropping method described in Section 4.2 (Implementation Details) of our paper. Alternatively, you may refer to the cropping method mentioned here. However, it's important to note that I cannot guarantee it will produce exactly the same results. You can give it a try. It has been a long time, and I have also forgotten some details.

Please be aware that the occlusion pattern predictor may not always provide reasonable mask predictions, as evidenced by the failure case presented in Figure 13 of our paper. This is due to the fact that we assign more weight to the face recognition loss than the pattern prediction loss, as the latter serves only an auxiliary role to aid face recognition.

PolarRobin commented 5 months ago

Thanks for your quick and concrete answer. I implemented your method. The images look well aligned now, right?

Unfortunately I only get 0 predictions, now. Am I still missing something?

Do I have to specify a crop_size?

haibo-qiu commented 5 months ago

Firstly, the image you've provided appears to be aligned. Are you referring to the occlusion in the lower left area of the face? The white region? Please note that we did not incorporate this type of occlusion (setting pixel value to 0 or 255) during training. You can refer to our occlusion cases from the figure 7 in the paper. Consequently, the model may not perform well in this scenario.

The pattern 0 signifies that the image is not occluded. In regards to the aforementioned image, I believe this is a reasonable prediction from our model, as it was not exposed to any 0 or 255 pixel occlusions during training.

Lastly, it's not necessary for you to set the size, as the default is 112x96.

PolarRobin commented 5 months ago

Thanks for the clarification. Actually, it is a white smartphone the person is holding in her hand. Other images in my dataset have hands or microphones in front of chin and mouth, yet the predictions are all 0. The paper stated that it might be transferable to in the wild scenarios, but it could be that it requires particular patterns. Thank you again.

haibo-qiu / FROM

Application of code to images in the wild #12