microsoft / CLAP

Learning audio concepts from natural language supervision
MIT License
486 stars 38 forks source link

Embeddings are non-deterministic even for durations < 7s #42

Open simonmandlik opened 1 month ago

simonmandlik commented 1 month ago

Hi, I have similar problem to https://github.com/microsoft/CLAP/issues/24, but I'm using shorter audio than 6 seconds.

MWE:

from msclap import CLAP
import torch
import subprocess

with torch.no_grad():
    clap_model = CLAP(version = "2023", use_cuda=False)

    f = "/home/simon.mandlik/test.wav"

    audio_embeddings_1 = clap_model.get_audio_embeddings([f])
    audio_embeddings_2 = clap_model.get_audio_embeddings([f])

    print(audio_embeddings_1)
    print(audio_embeddings_2)

    mse = torch.mean((audio_embeddings_1 - audio_embeddings_2)**2)
    print(mse)
    print(subprocess.check_output(['ffprobe', f, '-hide_banner']))
    print(clap_model.args)

Output:

tensor([[-1.5895, -0.9305,  0.0572,  ...,  1.6071, -0.0361,  0.6508]])
tensor([[-1.5228, -1.0532,  0.0794,  ...,  1.6698, -0.0152,  0.4471]])
tensor(0.0190)
Input #0, wav, from '/home/simon.mandlik/test.wav':
  Metadata:
    encoder         : Lavf61.1.100
  Duration: 00:00:06.00, bitrate: 1536 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, 2 channels, s16, 1536 kb/s
b''
Namespace(text_model='gpt2', text_len=77, transformer_embed_dim=768, freeze_text_encoder_weights=True, audioenc_name='HTSAT', out_emb=768, sampling_rate=44100, duration=7, fmin=50, fmax=8000, n_fft=1024, hop_size=320, mel_bins=64, window_size=1024, d_proj=1024, temperature=0.003, num_classes=527, batch_size=1024, demo=False)
simonmandlik commented 1 month ago

I found why. The culprit is this line. My audio is dual channel and this line doubles the actual "length" from 6s to 12s