Do you plan to release a notebook demo ?

ooza commented 1 month ago

Thanks for this great job! I have a small dataset of 30 video clips and I want to make zero-shot action recognition using your model. Do you have a simple demo file that I can use? or could you tell me which function/script/config should I update to work on custom videos?

byminji commented 1 month ago

Hi @ooza, thank you for your interest in our work!

I will share a sample notebook demo in the upcoming days.

But if you want to use your custom datasets before that, please follow the below instructions. (Please also refer to the example instructions for public datasets in DATASETS.md).

Put all your custom videos under /PATH/TO/VIDEOS folder.
Create a label file intc-clip/labels/custom_dataset_labels.csv. The format should be like:
```
id,name
0,abseiling
1,air drumming
2,answering questions
3,applauding
...
```
Create an annotation file that contains a list of video filenames and their corresponding labels in tc-clip/datasets_splits/custom_dataset_anns.txt. Each line of the txt file should be <filename> <class id>. For example, suppose that we have aaa.mp4, bbb.mp4, ..., zzz.mp4 under /PATH/TO/VIDEOS folder:
```
aaa.mp4 0
bbb.mp4 0
...
zzz.mp4 3
```

Create a dataset yaml file for your custom dataset in tc-clip/configs/data/custom_dataset.yaml. Below is an example of the inference-only case:

#@package _global_
data:
test:
- name: custom_dataset
  protocol: top1
  dataset_list:
  - dataset_name: custom_dataset
    root: /PATH/TO/VIDEOS
    num_classes: <YOUR_ACTUAL_NUM_CLASSES>
    label_file: tc-clip/labels/custom_dataset_labels.csv
    ann_file: tc-clip/datasets_splits/custom_dataset_anns.txt

Now run the below command. Note the data=custom_dataset part:

torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=custom_dataset output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=/PATH/TO/CHECKPOINTS/zero_shot_k400_tc_clip.pth

If you have any follow-up questions, feel free to ask. I will also mention you after adding a sample notebook.

ooza commented 1 month ago

Thanks @byminji for your quick reply. I had to modify some source files to avoid using apex's amp because it is deprecated! I used the autocast from PyTorch's amp. torch and torchvision versions: 2.1.2+cu118, 0.16.2+cu118 cuda version: 12.2 updated source files: File: tc_clip.py Function: forward Update:
import torch.cuda.amp as amp ...

    ` with amp.autocast():
               image_features, context_tokens, attn, source = self.image_encoder(image.type(self.dtype),
                                                                      return_layer_num=self.return_layer_num,
                                                                      return_attention=return_attention,
                                                                    return_source=return_source)`

File: engine.py Function: validate Update :

  `with amp.autocast():
            output = model(image_input)`

File: main.py Function: main_testing Update:

    `with amp.autocast(): 
            test_stats = validate(val_loader, model, logger, config)`

As I said before I have a small dataset of less than 30 short videos and 3 classes. So, I updated the accuracy_top1_top5 function in tools.py to handle fewer number of classes dynamically. I got this result: Screenshot 2024-09-11 at 14 28 48 My question is how to depict / analyze the predicted classes for each video ? BTW in the results the only output is log_rank0.txt thanks again

byminji commented 1 month ago

Hi @ooza, You can check individual filenames and predictions by modifying some parts of the code. You can get the file id metadata by running your command with ++gather_filename=true (See datasets/build.py#L174.) Below is a code snippet that I've used before.

from utils.print_utils import colorstr

@torch.no_grad()
def print_individual_predictions(val_loader, model, logger, config):
    """ Code snippet to print individual predictions """

    assert config.num_clip == 1     # Only supports single-view sampling case
    assert config.get("gather_filename")    # Run command with "++gather_filename=true" override

    model.eval()
    num_classes = len(val_loader.dataset.classes)
    class_mapping = {idx: cls for idx, cls in val_loader.dataset.classes}
    metric_logger = MetricLogger(delimiter="  ")
    header = 'Val:'

    logger.info(f"{config.num_clip * config.num_crop} views inference")
    for idx, batch_data in enumerate(metric_logger.log_every(val_loader, config.print_freq, logger, header)):
        image = batch_data['imgs'].cuda(non_blocking=True)
        image = image.view((-1, config.num_frames, 3) + image.size()[-2:])
        label_id = batch_data['label'].cuda(non_blocking=True)
        label_id = label_id.reshape(-1)  # [b]

        # Get file id metadata
        file_id = batch_data['file_id']

        b, t, c, h, w = image.size()
        tot_similarity = torch.zeros((b, num_classes)).cuda()

        # Forward
        output = model(image)
        logits = output["logits"]
        similarity = logits.view(b, -1).softmax(dim=-1)
        tot_similarity += similarity

        # Classification score
        acc1, acc5, indices_1, _ = accuracy_top1_top5(tot_similarity, label_id)
        metric_logger.meters['acc1'].update(float(acc1) / b * 100, n=b)
        metric_logger.meters['acc5'].update(float(acc5) / b * 100, n=b)

        # Print individual predictions
        for batch_idx in range(b):
            filename = val_loader.dataset.video_infos[file_id[batch_idx]]['filename']
            foldername, videoname = filename.split("/")[-2], filename.split("/")[-1]
            gt_label, pred_label = label_id[batch_idx].item(), indices_1[batch_idx].item()
            gt_cls, pred_cls = class_mapping[gt_label], class_mapping[pred_label]
            flag = colorstr("blue", "Correct") if gt_label == pred_label else colorstr("red", "Wrong")
            print(f"{videoname}: [{flag}] GT {gt_cls}, Pred {pred_cls}")

    metric_logger.synchronize_between_processes()
    logger.info(f' * Acc@1 {metric_logger.acc1.global_avg:.3f} Acc@5 {metric_logger.acc5.global_avg:.3f}')
    return metric_logger.get_stats()

Thank you.

ooza commented 1 month ago

Thanks @byminji ! I added the print_individual_predictions function just before the main_testing. Then, I modified main_testing to include a check on the gather_filename flag:

# If gather_filename is true, print individual predictions
        with amp.autocast():
            if config.get("gather_filename", False):
                logger.info("Using print_individual_predictions function.")
                test_stats = print_individual_predictions(val_loader, model, logger, config)
            else:
                logger.info("Using validate function.")
                test_stats = validate(val_loader, model, logger, config)

I had to add this at the beginning of the function:

if config.get("gather_filename", False):
    config.num_clip = 1

Otherwise I got this error: File "/home/vlm/tc-clip/main.py", line 151, in print_individual_predictions assert config.num_clip == 1 # Only supports single-view sampling case AssertionError

The issue now is a mismatch between the size of the preds and the targets:

Screenshot 2024-09-13 at 17 05 44

More details: Screenshot 2024-09-13 at 17 07 47

But when I modified the existing multi-view inference logic by setting the config.num_clip =1 instead of 2 (elif config.protocol == 'zero_shot' and config.multi_view_inference: config.num_clip = 1) it works! Is this safe and correct? or is there another more generic way to do it? any further explanations or details well be much appreciated. Thanks

byminji commented 1 month ago

Hi @ooza, Multi-view inference is a common strategy for increasing the accuracy of video recognition models by ensembling multiple predictions from differently sampled frames. Our paper used a 16 frames x 2 clips setting for comparison with 32 frame sampling models. You can either remove the multi-view inference or modify the code snippet to show results from multiple predictions. I simply implemented the single-view case only because it was for analysis, not for evaluation.

qingy1337 commented 1 month ago

@byminji I would be interested in a demo notebook as well, looking forward to that!

ooza commented 1 month ago

@byminji is it normal to get a lower accuracy using multi-view inference ?!

byminji commented 1 month ago

@byminji is it normal to get a lower accuracy using multi-view inference ?!

@ooza Usually, we get a higher accuracy with multi-view inference.

byminji commented 1 month ago

Hi @ooza @qingy1337, we've released a notebook demo for custom videos. Thanks :)

naver-ai / tc-clip

Do you plan to release a notebook demo ? #2