lessonxmk / head_fusion

17 stars 1 forks source link

About the split of training and test #3

Open wsstriving opened 3 years ago

wsstriving commented 3 years ago

Hi, In your latest papers, you claim that you have achieved the "best" performance across all different methods, however, I found that you only split the training and test data randomly as you described in the paper. My question is, do you consider avoiding overlapping speakers in the training and test data? If not it's not fair to make such a claim. Please correct me if I am wrong, thanks.

hemarathore commented 3 years ago

@wsstriving, Though I have also opened an issue, and he has not replied to it yet. But I think you can compare the results of his with similar arrangments. In the paper, He did not claim with speaker-independent analysis. I think for a speaker-dependent case he got the best results.

The problem is in current model I didn't find Head Fusion?? He just concatenate the attention coeff., No averaging as mentioned in paper. Also the train test segmentation is on whole dataset, not impro, scripted or all for these four emotions, So I find these two problems, Plz clear anyone if I misunderstood something

hemarathore commented 3 years ago

@wsstriving could you please solve my queries?

wsstriving commented 3 years ago

@wsstriving could you please solve my queries? Hi, hemarathore, Actually, I am not going to reproduce his experiments, I am writing my own codes using some basic models for speaker recognition. I opened this issue just to check the splitting details. I nearly finished the code and will check the performance both for speaker-dependent and speaker-independent to see how much impact different setups will bring. After this, I will try to look into his code. One more question, can you confirm his results are the best in the speaker-dependent setup? Since I am not familiar with the SER task, it would be great if you could help to confirm. Best

hemarathore commented 3 years ago

Hey Thanks You can check on google, I have read many papers on SER, In his segmentation for improvised audio files 76% is the current highest I think so.He did the comparison in paper with latest papers. Also You cant trust every author, and publication. Please confirm you when you complete your task, I have reproduce the results and also work on my own. We can discuss a lot on this If you want.

On Wed, Jun 30, 2021 at 8:41 AM wsstriving @.***> wrote:

@wsstriving https://github.com/wsstriving could you please solve my queries? Hi, hemarathore, Actually, I am not going to reproduce his experiments, I am writing my own codes using some basic models for speaker recognition. I opened this issue just to check the splitting details. I nearly finished the code and will check the performance both for speaker-dependent and speaker-independent to see how much impact different setups will bring. After this, I will try to look into his code. One more question, can you confirm his results are the best in the speaker-dependent setup? Since I am not familiar with the SER task, it would be great if you could help to confirm. Best

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lessonxmk/head_fusion/issues/3#issuecomment-871063130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUUTZQV54HDYK6B77HZ3Z63TVKDOBANCNFSM47PNHWQQ .

hemarathore commented 3 years ago

I don't know why he is not replying, There is one more problem with the paper: As you said you took mean of all attention head, you should take the average by number of heads but you are doing it average pooling layer for an output kernel of size [1,1], then how come it is attention mean. plz if someone else clear it.

hemarathore commented 3 years ago

@wsstriving could you please solve my queries? Hi, hemarathore, Actually, I am not going to reproduce his experiments, I am writing my own codes using some basic models for speaker recognition. I opened this issue just to check the splitting details. I nearly finished the code and will check the performance both for speaker-dependent and speaker-independent to see how much impact different setups will bring. After this, I will try to look into his code. One more question, can you confirm his results are the best in the speaker-dependent setup? Since I am not familiar with the SER task, it would be great if you could help to confirm. Best

Hi sir Did you find any solution? He is only segmenting the whole data ie around 10033 files into 80-20 ratio, I thought. Also I am not getting the validation part model.eval() How is the batch size is changing everytime,say if last audio sample in validation is 5 seconds long then there would be 80000 samples, and only 3 segmented frames right? Then how is it going to 12 batch size and all. Also at last as per paper he is averaging all results how can someone verify that?

hemarathore commented 3 years ago

@wsstriving sir did you get your results? please guide

hemarathore commented 3 years ago

Also another question to mk, can we use torch.mul instead of torch.matmul as given in documentation always for calculation of attention score between query and key. please any of you reply. Regards

rishikeshraj5 commented 3 years ago

@hemarathore Yes you are right. He is using torch.mul which is incorrect. It should be torch.matmul. There are few other mistakes acc to me, for linear operation he is using conv layer that too with bias. which is actually not acc to theory of query key and value.

guiandrade2 commented 3 years ago

@hemarathore were you able to reproduce the results? i cant reproduce the results from this or from the more recent version of the model on https://github.com/lessonxmk/Optimized_attention_for_SER, I am not using the code on this repositories though, I have my own but I am never able to surpass 65-70%.

hemarathore commented 3 years ago

@guiandrade2 Hey Can we connect on emails plz? As far as your questions are concerned yes I have tried this code, and also modified it. But cant discusses it here as that's part of my ongoing research. Have you implemented that second code you mentioned?

as you are saying you doing on your own, How are you segmenting the whole dataset or the subsets individually, coz I think that will make the difference.