microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.63k stars 2.51k forks source link

[wavlm-large]abnormal layer results from wavlm_large model #1426

Open Kanraaaaa opened 8 months ago

Kanraaaaa commented 8 months ago

Hi,thanks for sharing pre-trained models. But I have met some problems as follows: I followed the sample code on this page: https://github.com/microsoft/unilm/tree/master/wavlm ,but I got abnormal layer results with the WavLM-Large.pt.

from WavLM import WavLM, WavLMConfig

# load the pre-trained checkpoints
checkpoint = torch.load('WavLM-Large.pt')
cfg = WavLMConfig(checkpoint['cfg'])
model = WavLM(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()
model.cuda()

# extract the representation of each layer
wav_input_16khz = torch.randn(1,10000).cuda()
if cfg.normalize:
    wav_input_16khz = torch.nn.functional.layer_norm(wav_input_16khz , wav_input_16khz.shape)
rep, layer_results = model.extract_features(wav_input_16khz, output_layer=model.cfg.encoder_layers, ret_layer_results=True)[0]
layer_reps = [x.transpose(0, 1) for x, _ in layer_results]
reps = torch.cat(layer_reps)
print(reps.shape, reps.max(), reps.min())
# torch.Size([25, 31, 1024]) tensor(3.2962e+37, device='cuda:0', grad_fn=<MaxBackward1>) tensor(-2.1879e+36, device='cuda:0', grad_fn=<MinBackward1>)

When I infer on cpu, the results of last 2 layers are always NaN. When I infer on gpu, the max value of layer_results is 3.4342e+37.

ddlBoJack commented 8 months ago

Hi, I have the same problem with WavLM Large. Have you solved it?

Kanraaaaa commented 8 months ago

Hi, I have the same problem with WavLM Large. Have you solved it?

Hi, I haven't solved this problem :( I tried different torch versions (including 1.12, 1.13, 2.0.1... ) and different platforms such as windows and linux. This problem still exists.

But I found relatively reasonable results by loading huggingface models from https://huggingface.co/microsoft/wavlm-large.

ps. I have no idea about whether 600 and -100 are reasonable actually. The max and min values extracted from wavlm-base+ are around ±5.

from transformers import AutoModel
import torchaudio
wavlm = AutoModel.from_pretrained('pretrained/large_hf')

wav_input_16khz = torch.randn(1,10000)

with torch.no_grad():
    wav_embeddings = wavlm(input_values=wav_input_16khz, output_hidden_states=True).hidden_states

rep = torch.cat(wav_embeddings)
print(rep.shape, rep.max(), rep.min())
# gpu: torch.Size([25, 31, 1024]) tensor(608.1801, device='cuda:0') tensor(-123.8610, device='cuda:0')
# cpu: torch.Size([25, 31, 1024]) tensor(602.7969) tensor(-124.4809)
from transformers import AutoModel
import torchaudio
wavlm = AutoModel.from_pretrained('pretrained/wavlm-base-plus').cuda()

wav_input_16khz = torch.randn(1,10000).cuda()

with torch.no_grad():
    wav_embeddings = wavlm(input_values=wav_input_16khz, output_hidden_states=True).hidden_states

rep = torch.cat(wav_embeddings)
print(rep.shape, rep.max(), rep.min())
# cpu torch.Size([13, 31, 768]) tensor(3.1511) tensor(-4.5721)
# gpu torch.Size([13, 31, 768]) tensor(3.2018, device='cuda:0') tensor(-4.5899, device='cuda:0')