ttlmh / Bridge-Prompt

[CVPR2022] Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos
93 stars 7 forks source link

take advantage of Br-Prompt to extract frame-wise features for custom dataset #3

Open yo3nglau opened 2 years ago

yo3nglau commented 2 years ago

Recently I have worked on conducting action segmentation experiments on a custom dataset. It is composed of about 100 videos with 15 action classes. Each video contains 15-20 action instances and is about 2 minutes long. I have already successfully applied ASFormer to this dataset and obtained fairly reasonable performance. Nevertheless, my fashion of extracting features is relatively less effective than your proposed up-to-date. I consider to take advantage of your prominent fashion for the custom dataset and personally view that there are two main designs to be complemented, which expect your kindly instruction.

Firstly, I'd like to figure out how train_split1/2/3/4_nf16_ol[2, 1, 0.5]_ds[1, 2, 4].npy were made. Concretely for split1 of GTEA,

array([['S2_Cheese_C1', '0', '1'],
       ['S2_Cheese_C1', '32', '1'],
       ['S2_Cheese_C1', '64', '1'],
       ...,
       ['S4_Tea_C1', '1056', '4'],
       ['S4_Tea_C1', '1088', '4'],
       ['S4_Tea_C1', '1120', '4']], dtype='<U21')

May I generate with the format like [video_name, nf*ds*ol stride, ds], if I'm not mistaken?

Besides, I found diverse ds&ol were implemented due to different datasets. Can you give me feasible advice with respect to these configs on the custom dataset mentioned above?

Secondly, as presented in the conclusion part of the paper,

...manual label is a more accurate and concise form of semantic abstraction.

I realize that semantic information like putting bread on cheese and bread should be deposited in x_id2act/x_act2id.json. So maybe it's necessary for me to depict 15 actions analogously, right? If so, will length of the description be limited? Are there any other details I need to pay attention to? For example, max_act in x_ft.yaml perhaps impacts something, thus could you offer me instructions on these subtle details?

By the way, I observe that randaug in x_ft.yaml is annotated, will further performance be achieved in the wake of this operation?

I anticipate deriving much instruction from you. Thanks !

ttlmh commented 2 years ago

Hi, thank you for your close attention to our work.

For your first question, we have uploaded a Python script preprocess/extract_datawindow.py which we used to generate the window-based labels from the raw GTEA dataset. Besides, the downsampling rate (ds) and the overlap rate (ol) are heuristically selected according to the characteristics of the dataset. Generally speaking, the windows should be not too long (otherwise the information about each action is too sparse) or too short (otherwise the windows are unable to contain more than two actions). An easy way is to mix both long and short windows as we did for GTEA. However, you should also pay attention to the cost since the increasement of window fineness will create more training data. We suggest that you should first try the same ds&ol settings as we did for GTEA (remember to consider the fps rate as well).

For your second question, you should create similar .json files as we did for the three given datasets. Since language instructions are necessary for our work, you should depict the 15 actions in language form (you don't have to strictly follow the same format as the labels in GTEA). The maximum length of each language instruction is 77 words. You should also create a new class for your own dataset in datasets/datasets.py following the other classes. The max_act in x_ft.yaml implies the maximum action quantities in each window (i.e. 16 frames). We did not enable RandAugment during our training process, but you can take a try to see if the performance will get better!

yo3nglau commented 2 years ago

Hi, thank you for your close attention to our work.

For your first question, we have uploaded a Python script preprocess/extract_datawindow.py which we used to generate the window-based labels from the raw GTEA dataset. Besides, the downsampling rate (ds) and the overlap rate (ol) are heuristically selected according to the characteristics of the dataset. Generally speaking, the windows should be not too long (otherwise the information about each action is too sparse) or too short (otherwise the windows are unable to contain more than two actions). An easy way is to mix both long and short windows as we did for GTEA. However, you should also pay attention to the cost since the increasement of window fineness will create more training data. We suggest that you should first try the same ds&ol settings as we did for GTEA (remember to consider the fps rate as well).

For your second question, you should create similar .json files as we did for the three given datasets. Since language instructions are necessary for our work, you should depict the 15 actions in language form (you don't have to strictly follow the same format as the labels in GTEA). The maximum length of each language instruction is 77 words. You should also create a new class for your own dataset in datasets/datasets.py following the other classes. The max_act in x_ft.yaml implies the maximum action quantities in each window (i.e. 16 frames). We did not enable RandAugment during our training process, but you can take a try to see if the performance will get better!

Thank you for seasonable guidance !

I've made some attempts based on your suggestions these days and have made some progress. But I still have two main confusions and hope to get your assistance. Firstly, I don't quite understand how ds and ol work, I followed the same ds&ol as those on GTEA and unfortunately found that resulted in huge cost (roughly 32-33h per split for training with 4 NVIDIA GeForce 2080 Ti GPUs), so I empirically set ds=[2,4,8] and ol=[4,2,1] after a few trials (some of them led to more than 6 acts or 77 words in each 16 frames window, though I made efforts to enlarge the max_acts=7/8 and curtail the action semantic information labels as much as possible). Now the cost of each split (9-10h) is feasible but I'm not sure if these make sense. Another doubt is that I'm hesitating to select the final epoch model (50 epoch) or the model that possesses the minimal total_loss.

With your patient and kind help, I feel that I am close to the final answer, and I look forward to your further instructions!

ttlmh commented 2 years ago

For your first doubt, ds refers to the down-sampling rate of frames in each window, and ol refers to the overlap rate between every two windows. You can recheck the generated .npy files and the detailed implementation in datasets/datasets.py to learn the effect of ds&ol. Based on our experience, containing too many actions in one window may be detrimental to the effects of the current model. Thus, you should shorten the time range covered by each window (which is decided by ds). We suggest that you can try ds=[2,4] and ol=[2,2].

For your second doubt, we've used the final epoch to extract the features. We'll further compare the results between the final epoch and the minimal total_loss epoch.