Open ooza opened 1 month ago
Hi @ooza, thank you for your interest in our work!
I will share a sample notebook demo in the upcoming days.
But if you want to use your custom datasets before that, please follow the below instructions. (Please also refer to the example instructions for public datasets in DATASETS.md).
Put all your custom videos under /PATH/TO/VIDEOS
folder.
Create a label file intc-clip/labels/custom_dataset_labels.csv
. The format should be like:
id,name
0,abseiling
1,air drumming
2,answering questions
3,applauding
...
Create an annotation file that contains a list of video filenames and their corresponding labels in tc-clip/datasets_splits/custom_dataset_anns.txt
. Each line of the txt file should be <filename> <class id>
. For example, suppose that we have aaa.mp4
, bbb.mp4
, ..., zzz.mp4
under /PATH/TO/VIDEOS
folder:
aaa.mp4 0
bbb.mp4 0
...
zzz.mp4 3
Create a dataset yaml file for your custom dataset in tc-clip/configs/data/custom_dataset.yaml
. Below is an example of the inference-only case:
#@package _global_
data:
test:
- name: custom_dataset
protocol: top1
dataset_list:
- dataset_name: custom_dataset
root: /PATH/TO/VIDEOS
num_classes: <YOUR_ACTUAL_NUM_CLASSES>
label_file: tc-clip/labels/custom_dataset_labels.csv
ann_file: tc-clip/datasets_splits/custom_dataset_anns.txt
Now run the below command. Note the data=custom_dataset
part:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=custom_dataset output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=/PATH/TO/CHECKPOINTS/zero_shot_k400_tc_clip.pth
If you have any follow-up questions, feel free to ask. I will also mention you after adding a sample notebook.
Thanks @byminji for your quick reply.
I had to modify some source files to avoid using apex's amp because it is deprecated! I used the autocast from PyTorch's amp.
torch and torchvision versions: 2.1.2+cu118, 0.16.2+cu118
cuda version: 12.2
updated source files:
File: tc_clip.py
Function: forward
Update:
import torch.cuda.amp as amp
...
` with amp.autocast():
image_features, context_tokens, attn, source = self.image_encoder(image.type(self.dtype),
return_layer_num=self.return_layer_num,
return_attention=return_attention,
return_source=return_source)`
File: engine.py
Function: validate
Update :
`with amp.autocast():
output = model(image_input)`
File: main.py
Function: main_testing
Update:
`with amp.autocast():
test_stats = validate(val_loader, model, logger, config)`
As I said before I have a small dataset of less than 30 short videos and 3 classes.
So, I updated the accuracy_top1_top5
function in tools.py
to handle fewer number of classes dynamically.
I got this result:
My question is how to depict / analyze the predicted classes for each video ?
BTW in the results the only output is log_rank0.txt
thanks again
Hi @ooza, You can check individual filenames and predictions by modifying some parts of the code. You can get the file id metadata by running your command with ++gather_filename=true
(See datasets/build.py#L174.) Below is a code snippet that I've used before.
from utils.print_utils import colorstr
@torch.no_grad()
def print_individual_predictions(val_loader, model, logger, config):
""" Code snippet to print individual predictions """
assert config.num_clip == 1 # Only supports single-view sampling case
assert config.get("gather_filename") # Run command with "++gather_filename=true" override
model.eval()
num_classes = len(val_loader.dataset.classes)
class_mapping = {idx: cls for idx, cls in val_loader.dataset.classes}
metric_logger = MetricLogger(delimiter=" ")
header = 'Val:'
logger.info(f"{config.num_clip * config.num_crop} views inference")
for idx, batch_data in enumerate(metric_logger.log_every(val_loader, config.print_freq, logger, header)):
image = batch_data['imgs'].cuda(non_blocking=True)
image = image.view((-1, config.num_frames, 3) + image.size()[-2:])
label_id = batch_data['label'].cuda(non_blocking=True)
label_id = label_id.reshape(-1) # [b]
# Get file id metadata
file_id = batch_data['file_id']
b, t, c, h, w = image.size()
tot_similarity = torch.zeros((b, num_classes)).cuda()
# Forward
output = model(image)
logits = output["logits"]
similarity = logits.view(b, -1).softmax(dim=-1)
tot_similarity += similarity
# Classification score
acc1, acc5, indices_1, _ = accuracy_top1_top5(tot_similarity, label_id)
metric_logger.meters['acc1'].update(float(acc1) / b * 100, n=b)
metric_logger.meters['acc5'].update(float(acc5) / b * 100, n=b)
# Print individual predictions
for batch_idx in range(b):
filename = val_loader.dataset.video_infos[file_id[batch_idx]]['filename']
foldername, videoname = filename.split("/")[-2], filename.split("/")[-1]
gt_label, pred_label = label_id[batch_idx].item(), indices_1[batch_idx].item()
gt_cls, pred_cls = class_mapping[gt_label], class_mapping[pred_label]
flag = colorstr("blue", "Correct") if gt_label == pred_label else colorstr("red", "Wrong")
print(f"{videoname}: [{flag}] GT {gt_cls}, Pred {pred_cls}")
metric_logger.synchronize_between_processes()
logger.info(f' * Acc@1 {metric_logger.acc1.global_avg:.3f} Acc@5 {metric_logger.acc5.global_avg:.3f}')
return metric_logger.get_stats()
Thank you.
Thanks @byminji !
I added the print_individual_predictions
function just before the main_testing
. Then, I modified main_testing
to include a check on the gather_filename
flag:
# If gather_filename is true, print individual predictions
with amp.autocast():
if config.get("gather_filename", False):
logger.info("Using print_individual_predictions function.")
test_stats = print_individual_predictions(val_loader, model, logger, config)
else:
logger.info("Using validate function.")
test_stats = validate(val_loader, model, logger, config)
I had to add this at the beginning of the function:
if config.get("gather_filename", False):
config.num_clip = 1
Otherwise I got this error: File "/home/vlm/tc-clip/main.py", line 151, in print_individual_predictions assert config.num_clip == 1 # Only supports single-view sampling case AssertionError
The issue now is a mismatch between the size of the preds and the targets:
More details:
But when I modified the existing multi-view inference logic by setting the config.num_clip =1
instead of 2 (elif config.protocol == 'zero_shot' and config.multi_view_inference: config.num_clip = 1)
it works!
Is this safe and correct? or is there another more generic way to do it? any further explanations or details well be much appreciated.
Thanks
Hi @ooza, Multi-view inference is a common strategy for increasing the accuracy of video recognition models by ensembling multiple predictions from differently sampled frames. Our paper used a 16 frames x 2 clips setting for comparison with 32 frame sampling models. You can either remove the multi-view inference or modify the code snippet to show results from multiple predictions. I simply implemented the single-view case only because it was for analysis, not for evaluation.
@byminji I would be interested in a demo notebook as well, looking forward to that!
@byminji is it normal to get a lower accuracy using multi-view inference ?!
@byminji is it normal to get a lower accuracy using multi-view inference ?!
@ooza Usually, we get a higher accuracy with multi-view inference.
Hi @ooza @qingy1337, we've released a notebook demo for custom videos. Thanks :)
Thanks for this great job! I have a small dataset of 30 video clips and I want to make zero-shot action recognition using your model. Do you have a simple demo file that I can use? or could you tell me which function/script/config should I update to work on custom videos?