pliang279 / MFN

[AAAI 2018] Memory Fusion Network for Multi-view Sequential Learning
MIT License
114 stars 30 forks source link

Feature selection #4

Open valbarriere opened 5 years ago

valbarriere commented 5 years ago

Hi Paul, Thanks for sharing the code. I have a question about the feature selection, which is not mentioned in your paper. Since we don't have the file /media/bighdd5/Paul/mosi/fs_mask.pkl, could you tell us which parameters work the best on that dataset and how did you obtained them ? Cheers, Valentin

ghost commented 5 years ago

@valbarriere the feature selection was done in a previous paper: Multimodal sentiment analysis with word-level fusion and reinforcement learning This is only done for CMU-MOSI.

And here are the values (first for covarep and then facet): [[1, 3, 6, 25, 60], [0, 2, 5, 10, 11, 12, 14, 17, 20, 21, 22, 24, 25, 29, 30, 31, 32, 36, 37, 40]]

valbarriere commented 5 years ago

Ok thanks! I just saw you already linked the ICMI paper in an other SDK issue yesterday.

Since I'm here, did you also use padding on the POM dataset (for MOSI the length of all the sequence is 20) ? I couldn't find any information about that on the paper. I'm trying to replicate the results in order to compare my model with the MFN on POM.

ghost commented 5 years ago

We actually did. You can get the exact POM data from here: http://immortal.multicomp.cs.cmu.edu/raw_datasets/old_processed_data/pom/data/

We actually calculate the expected audio and visual and verbal contexts based on sentence (average word embeddings per sentence) as LSTMs are not good with long sequences. This and ICT-MMMO are the only ones we do this. I think the data is already in this format.

valbarriere commented 5 years ago

Great, thanks! I also started running the experiments on ICT-MMMO, MOUD and YOUTUBE. But I think it would be better with the new configurations used in "Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities". The results of the MFN are really different in this article (passing from 87,5 to 73,8 for ICT). Do you also have an easy access to them ?

Finally, for the POM dataset there is 17 labels per video, can you tell me where to find the name of the labels associated with each of the 17 columns ?

Thanks again !

ghost commented 5 years ago

I am actually not an author on that paper so I don't know how the experiments were done. Let me include @pliang279 to the chain also. Paul can probably answer for the name of the labels as well.

valbarriere commented 5 years ago

Ok thanks I'm waiting for Paul answer. Should I send him an email ?

ghost commented 5 years ago

@valbarriere I think that would be a good idea.

valbarriere commented 5 years ago

OK, I just sent @pliang279 an email. I will summarize here what will come out from the discussion as soon as I have answers

pliang279 commented 5 years ago

Hey @valbarriere, I just saw your email. Here are some answers:

  1. Yes, the MMMO dataset (and Youtube dataset) changed during the course of 2018 since we changed our video and audio feature extractor versions as well as their sampling rates. in subsequent papers, all models were retrained on these new versions of the datasets. I will upload these new datasets now.

  2. Here are the names of labels:

0 confident 1 passionate 2 voice pleasant 3 dominant 4 credible 5 vivid 6 expertise 7 entertaining 8 reserved 9 trusting 10 relaxed 11 outgoing 12 thorough 13 nervous 14 sentiment 15 persuasive 16 humerous

we did not report results on index 14 sentiment since we ran the model on 3 other sentiment analysis datasets.

  1. The hyperparameters for POM are different from those for MOSI.
valbarriere commented 5 years ago

Thanks for the details @pliang279 ! I still have 2 questions :

  1. Where can I find the uploaded version of the datasets ?

  2. Can you tell me the hyperparameters grid you use in order to reproduce your results on POM, like for MOSI ?

valbarriere commented 5 years ago

Hi @A2Zadeh, @pliang279, just to be sure : I know the hyperparameters should not be the same for the best models on POM and MOSI, I'm talking about the grid used to search the best hyperparameters.

I try to replicate the MFN results on the POM dataset but cannot reach your performances (I stop after 100 runs, I thinks it's fair...). Did you use the same hyperparameter grid on the POM dataset than the one used on MOSI (here below) ? I cannot reproduce the article's results with this grid...

    hl = random.choice([32,64,88,128,156,256])
    ha = random.choice([8,16,32,48,64,80])
    hv = random.choice([8,16,32,48,64,80])
    config["h_dims"] = [hl,ha,hv]
    config["memsize"] = random.choice([64,128,256,300,400])
    config["windowsize"] = 2
    config["batchsize"] = random.choice([32,64,128,256])
    config["num_epochs"] = 50
    config["lr"] = random.choice([0.001,0.002,0.005,0.008,0.01])
    config["momentum"] = random.choice([0.1,0.3,0.5,0.6,0.8,0.9])
    NN1Config = dict()
    NN1Config["shapes"] = random.choice([32,64,128,256])
    NN1Config["drop"] = random.choice([0.0,0.2,0.5,0.7])
    NN2Config = dict()
    NN2Config["shapes"] = random.choice([32,64,128,256])
    NN2Config["drop"] = random.choice([0.0,0.2,0.5,0.7])
    gamma1Config = dict()
    gamma1Config["shapes"] = random.choice([32,64,128,256])
    gamma1Config["drop"] = random.choice([0.0,0.2,0.5,0.7])
    gamma2Config = dict()
    gamma2Config["shapes"] = random.choice([32,64,128,256])
    gamma2Config["drop"] = random.choice([0.0,0.2,0.5,0.7])
    outConfig = dict()
    outConfig["shapes"] = random.choice([32,64,128,256])
    outConfig["drop"] = random.choice([0.0,0.2,0.5,0.7])
ghost commented 5 years ago

@valbarriere that is strange. Do you let your models train for a large number of epochs? Do you use Adam? How close do you get to the paper results?

valbarriere commented 5 years ago

Stop after 30 epochs (I saw that after 30 epochs it generally does not improve), Adam, 100 runs on the grid search.

I just started a new test on the first column putting at 50 epochs the stopping criterion, and couldn't obtain better results (even worse than before : 1.021 for the best model)

The best mae I got, for the 10 first columns : 1.001 instead of 0.952 1.015 instead of 0.993 0.892 instead of 0.882 0.876 instead of 0.835 0.986 instead of 0.903 0.959 instead of 0.908 0.918 instead of 0.886 0.948 instead of 0.913 0.848 instead of 0.821 0.528 instead of 0.521 0.575 instead of 0.566

Maybe it is the number of runs... How many runs did you try before obtaining the best results for each of the columns ?

ghost commented 5 years ago

Well, we definitely do a lot of runs on the validation set. However, we also do multitask learning, which we output all the values at the same time as opposed to just one value at a time. Helps a bit with the performance. I think 50 epochs is also too low, we were doing around 2500 and picked the best validation one. Hope this helps. Keep us in the loop of how the experiments go.

valbarriere commented 5 years ago

Ok, thanks for the informations.

In order to summarize : multitask learning over the different traits, each model ran for 2500 epochs (50 times more than for MOSI where you stopped at 50 epochs, that seems a lot) and you took the best on the validation set. You did that “definitely a lot of times” regarding different hyper parameters values.

Since you did multi-task learning, is it one only model that can reach the best performances for all the speaker traits or are there several best models learned in a multitask fashioned way (one per trait for example) ?

I keep you in touch about the results. Thanks again

ghost commented 5 years ago

@valbarriere great. Yes we pick the best for each trait, there is no single model that does best. In a way, we use other POM labels to help with the training (the other POM labels are not inputs to the model but outputs). Goes without saying that baselines in our tables also do the same for training.