Open simonmandlik opened 1 month ago
Hi, I have similar problem to https://github.com/microsoft/CLAP/issues/24, but I'm using shorter audio than 6 seconds.
MWE:
from msclap import CLAP import torch import subprocess with torch.no_grad(): clap_model = CLAP(version = "2023", use_cuda=False) f = "/home/simon.mandlik/test.wav" audio_embeddings_1 = clap_model.get_audio_embeddings([f]) audio_embeddings_2 = clap_model.get_audio_embeddings([f]) print(audio_embeddings_1) print(audio_embeddings_2) mse = torch.mean((audio_embeddings_1 - audio_embeddings_2)**2) print(mse) print(subprocess.check_output(['ffprobe', f, '-hide_banner'])) print(clap_model.args)
Output:
tensor([[-1.5895, -0.9305, 0.0572, ..., 1.6071, -0.0361, 0.6508]]) tensor([[-1.5228, -1.0532, 0.0794, ..., 1.6698, -0.0152, 0.4471]]) tensor(0.0190) Input #0, wav, from '/home/simon.mandlik/test.wav': Metadata: encoder : Lavf61.1.100 Duration: 00:00:06.00, bitrate: 1536 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s b'' Namespace(text_model='gpt2', text_len=77, transformer_embed_dim=768, freeze_text_encoder_weights=True, audioenc_name='HTSAT', out_emb=768, sampling_rate=44100, duration=7, fmin=50, fmax=8000, n_fft=1024, hop_size=320, mel_bins=64, window_size=1024, d_proj=1024, temperature=0.003, num_classes=527, batch_size=1024, demo=False)
I found why. The culprit is this line. My audio is dual channel and this line doubles the actual "length" from 6s to 12s
Hi, I have similar problem to https://github.com/microsoft/CLAP/issues/24, but I'm using shorter audio than 6 seconds.
MWE:
Output: