sucv / ABAW3

We achieved the 2nd and 3rd places in ABAW3 and ABAW5, respectively.
18 stars 4 forks source link

`modality` settings. #6

Open zgp123-wq opened 4 months ago

zgp123-wq commented 4 months ago

I'm encountering several challenges while training my model and would appreciate guidance on resolving them, particularly regarding the modality settings.

  1. Error with Batch Size Setting at 12: When setting the batch size to 12, I encountered a RuntimeError due to inconsistent tensor sizes during stacking. The error message states: "stack expects each tensor to be equal size, but got [1919, 3, 40, 40] at entry 0 and [3673, 3, 40, 40] at entry 1."

  2. Out-of-Memory Error during Training with Batch Size 1: Despite reducing the batch size to 1 during training and selecting modality as "video," "mfcc," "vggish," and "VA_continuous_label," I faced persistent out-of-memory errors. It seems that the memory consumption remains high. Could you suggest strategies to mitigate memory usage during training, considering the specified modalities?

  3. Calculation Issue with Concordance Correlation Coefficient (CCC) during Validation: During the validation process, I encountered difficulties in calculating the Concordance Correlation Coefficient (CCC). Despite employing standard procedures and specifying modality as "mfcc," "vggish," "bert," and "VA_continuous_label," the calculation consistently fails. Are there any specific considerations regarding the modality setting that might affect CCC calculation? I would appreciate any advice on ensuring accurate CCC calculation during validation.

Jesayy commented 4 months ago

Hello, may I ask what error you encountered with the last question? When I ran it, I got an error like "No such file or directory: '/aff_wild2_abaw5/mean_std_info.pkl". Do you have this file?

sucv commented 4 months ago

Hi zgp123-wq,

  1. Does it happen in batchsize8 or 2? If 8 or 2 works, then the only way to find out is to debug the code, set a breakpoint in getitem(self, index) for the dataset.py, and see which trial caused the error, then maybe go back to dataset_info.pkl, check the length recorded there, and check the actual mp4 length, etc.
  2. I won't run the code in my own desktop. I run the code in computing servers, usually having hundreds of RAM and Video-RAM. If you choose to run in your local PC (having 32 GB RAM, it may lead to this error). However, in case it is caused by some problematic code, please still set break points in anywhere, probably dataset.py, to see what exactly caused the error and if it's code issue or truly insufficient RAM.
  3. The error only occurs for the three modalities? It should have nothing to do with the modalities. Please set a break point in the line calculating the CCC, to make sure the preds and labels have correct shapes, e.g. Nx1 and Nx1.

In short, I cannot answer your questions directly, only way is to debug line by line. Sorry I wrote the code in such a poor manner.

Jesayy commented 4 months ago

Hello, I've encountered a similar error to the first problem you mentioned. Have you resolved it?

praveena2j commented 4 months ago

I think there is a bug in the code. torch data loader takes samples with equal shape but in the code, each trial is considered as a sample from getitem. SInce each trial has different shapes it throws this error

praveena2j commented 4 months ago

From the paper, I think the trials are divided into temporal sequences of 300 and each samples is considered as a sample of 300 samples, can you please confirm this ?

praveena2j commented 4 months ago

I think to fix the issue we need to change the windowing parameter to True in base/experiment.py line

self.data_arranger.generate_partitioned_trial_list(window_length=self.window_length, hop_length=self.hop_length, fold=fold, windowing=True)

sucv commented 4 months ago

From the paper, I think the trials are divided into temporal sequences of 300 and each samples is considered as a sample of 300 samples, can you please confirm this ?

confirmed.

sucv commented 4 months ago

I think to fix the issue we need to change the windowing parameter to True in base/experiment.py line

self.data_arranger.generate_partitioned_trial_list(window_length=self.window_length, hop_length=self.hop_length, fold=fold, windowing=True)

Thanks for your debugging and explanation. Indeed, setting windowing=True would use sliding window to sample each trial. Whereas setting windowing=False would load a complete trial. The latter is useful when you want to generate the output for the test trials with batch_size = 1.

But even when windowing=True, the output will still be restored to its original length (average if one time step has multiple output due to window overlapping) and the epoch CCC is then calculated over the restored output and VA labels.

praveena2j commented 4 months ago

Thank you for the confirmation

zgp123-wq commented 4 months ago

Thank you very much for your reply. However, I have noticed a strange phenomenon: in the training set evaluation, the CCC score is only slightly above 50, while in the validation set evaluation, it reaches 65. In theory, the CCC score in the training set evaluation should be very close to 100.

praveena2j commented 4 months ago

May I know where do you load data annotations in the script ?

sucv commented 4 months ago

Thank you very much for your reply. However, I have noticed a strange phenomenon: in the training set evaluation, the CCC score is only slightly above 50, while in the validation set evaluation, it reaches 65. In theory, the CCC score in the training set evaluation should be very close to 100.

Your "theory" has flaw then. Let me ask you,

Training CCC: 100
Validation CCC: 55

Training CCC: 55
Validation CCC: 54

Which model instance would you choose for submission?

Actually, you should say "strange" had the code reached CCC 100 in training.

praveena2j commented 4 months ago

can you share the "dataset_info.pkl" and "mean_std_info.pkl" ?

praveena2j commented 4 months ago

Also can you please let me know whether you trim the videos because "trim_video_fn" is not being called in preprocessing.py script as there is "trim_video" param is not set in config.py script

zgp123-wq commented 3 months ago
train_ccc_and_val_ccc

datasetinfo_mean_sttdd.zip I'm sorry, I just saw the message. I have provided the train_ccc and val_ccc, as well as the compressed file containing "dataset_info.pkl" and "mean_std_info.pkl" in the attachment. @sucv @praveena2j

sucv commented 3 months ago

Thx for your screenshot, now I understand. It seems to be under-fitting. But yeah, I also saw this in my training.

As the epoch goes up, say, 10-20 epoch more, the trainning CCC should gradually increase and exceed the validation.

Or, you may set “load_best_model_at_each_epoch=0” to not restrict the model state updating. I always set it to 1 and I ignored the issue you mentioned. Because by doing so I can have a few gain on validation CCC.

Anyway, the root of this issue is unknown, probably related to the data/label distribution of this split.

On Sat, Mar 9, 2024 at 11:29 zgp123-wq @.***> wrote:

train_ccc_and_val_ccc.png (view on web) https://github.com/sucv/ABAW3/assets/58291725/c93cc67f-1f23-4b77-9a11-0a4632680308 datasetinfo_mean_sttdd.zip I'm sorry, I just saw the message. I have provided the train_ccc and val_ccc, as well as the compressed file containing "dataset_info.pkl" and "mean_std_info.pkl" in the attachment. @sucv https://github.com/sucv @praveena2j https://github.com/praveena2j

— Reply to this email directly, view it on GitHub https://github.com/sucv/ABAW3/issues/6#issuecomment-1986712617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBMSED2EQQXUKS5RM22K63YXJ62XAVCNFSM6AAAAABDSGZXPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG4YTENRRG4 . You are receiving this because you were mentioned.Message ID: @.***>

sucv commented 3 months ago

Also, according to your screenshot, the previous best epoch is 7 and then until epoch 21 the model state is updated. This is what I meant to happen, that is, update only if a higher Val ccc achieved, otherwise load the historical best model state.

I don’t want the training CCC to increase so fast.

On Sat, Mar 9, 2024 at 11:29 zgp123-wq @.***> wrote:

train_ccc_and_val_ccc.png (view on web) https://github.com/sucv/ABAW3/assets/58291725/c93cc67f-1f23-4b77-9a11-0a4632680308 datasetinfo_mean_sttdd.zip I'm sorry, I just saw the message. I have provided the train_ccc and val_ccc, as well as the compressed file containing "dataset_info.pkl" and "mean_std_info.pkl" in the attachment. @sucv https://github.com/sucv @praveena2j https://github.com/praveena2j

— Reply to this email directly, view it on GitHub https://github.com/sucv/ABAW3/issues/6#issuecomment-1986712617, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBMSED2EQQXUKS5RM22K63YXJ62XAVCNFSM6AAAAABDSGZXPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWG4YTENRRG4 . You are receiving this because you were mentioned.Message ID: @.***>

praveena2j commented 3 months ago

Thanks for sharing the dataset_info.pkl files, in this file there is only 418 videos, I guess it is supposed to have 594 videos

sucv commented 3 months ago

Dear Gnana, you are supposed to generate them using the code. You already have everything needed to do so.

On Sat, Mar 9, 2024 at 23:15 R. Gnana Praveen @.***> wrote:

Thanks for sharing the dataset_info.pkl files, in this file there is only 418 videos, I guess it is supposed to have 594 videos

— Reply to this email directly, view it on GitHub https://github.com/sucv/ABAW3/issues/6#issuecomment-1986882348, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBMSEEFMKKTN4GN7KZPV5DYXMRQBAVCNFSM6AAAAABDSGZXPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBWHA4DEMZUHA . You are receiving this because you were mentioned.Message ID: @.***>

praveena2j commented 3 months ago

Thanks for the update.