Open braveheartwithlove opened 4 years ago
Hi, I think what you are doing is one of the correct ways to solve the problem.
Generally speaking, as we have FC layers in our model, the number of neurons is related to the input feature size. For example in MFCC, the feature is [243, n_frames], where 243 is due to we select 24 cc's and their 1st and 2nd order derivatives. The n_frames is owing to the short-time frequency analysis used in MFCC (and other features) and is determined by sample-length, window-size, window-overlap, etc. This can be changed at the feature extraction stage and you can refer to Librosa documents for details.
This mismatch problem happens probably because after being updated and the default frame size has changed. Or it might happen when we do modifications on other ResNets and changed the structure of residual blocks.
So there is a few things you can try. (1) When do CQCC, MFCC, try a smaller window length or a higher overlap across window so that you produce longer features. (2) Increase the number of cepstral coefficients you take. For example, now in MFCC we use 24, may be you can incorporate more CCs like 30. (3) Change the number of residual blocks or change the stride in the convolutional layers. Stride=3 means you shink the feature 3x per block. For MFCC if the raw feature is relatively small in dimension, you could try not using the stride.
After you try any of this, you should also adjust the size of fc1. An ideal case would be have an (n, 128) layer where n is a few hundred(e.g. 480). Sorry for the late reply. Hope this helps. Currently, the model structure design is mostly heuristic thus might require some trial and error.
@braveheartwithlove I solved this problem with your method. But after this procedure, when I ran fuse_result.py, I got the error that: can't decode input files. The same error happened for both .npy and .pth files. Have you encountered such error? Or what kind of files should used as input in fuse_result.py?
@wangziqi000 Ziqi,Thanks for your detail explanations! I will do some research based on your suggestions and see how it goes. Once again, Thanks!
@wangziqi000 I was wondering whether you and your team decided on batch_size = 32 after experimenting with other batch sizes and found that this value gave the best results?
@ddave25 We tried batch_size=64 and it is not better than 32. For larger batch size you will have to take GPU memory into consideration.
@malzantot @wangziqi000 Thanks for sharing this repo! I tried to reproduce your results, but met some errors during training. For MFCCmodel, with your same code of feature extraction, Resnet setup, I got this size mismatch error, pointing to self.fc1 = nn.Linear(480, 128) https://github.com/nesl/asvspoof2019/blob/master/models.py#L54. And for CQCCmodel, I got similar runtime error. size mismatch, m1:[32x32], m2:[64x128].
I changed this line to self.fc1 = nn.Linear(32, 128) and they pass. I am wondering if my modification is correct, and why you have that setup in your architecture. self.fc1 = nn.Linear(480, 128) # for mfccmodel self.fc1 = nn.Linear(64, 128) # for cqccmodel
Thanks!