Open sakshamsingh1 opened 2 years ago
Dear @sakshamsingh1, Thanks for your interest!
Yes, as you mentioned, this code is the old version. For using both image and text for pre-training, you could download whole raw videos (image, audio, text) with yt_dlp. After that, we employ below scripts:
projection_audio_text = scale_constant1 * (audio_embedding @ text_embedding.T)
projection_audio_image = scale_constant2 * (audio_embedding @ image_embedding.T)
ce = torch.nn.CrossEntropyLoss()
text_contrastive_loss = ce(projection_audio_text, label) + ce(projection_audio_text.T, label)
image_contrastive_loss = ce(projection_audio_image, label) + ce(projection_audio_image.T, label)
loss = text_contrastive_loss + image_contrastive_loss
Thanks
Thanks @lsh3163 for the quick response. This makes sense.
I am very interested in your work and actively looking into it. Do you plan to push the newer code?
Thanks
Yes, I plan to update this code later, but I am not sure when it will be. So, if you have any questions about the complete code, feel free to ask me.
Thanks
Great, Thanks for being so helpful. I have some questions:
Which is the latest pre-trained audio encoder?
Can you provide the code for zero-shot audio classification over ESC-50 dataset and US-8k dataset?
Thanks
Dear @sakshamsingh1 This is my answer.
Thanks @lsh3163. That would be great!!
In particular, I am interested in knowing how you pre-process audio before feeding it into the audio-encoder (to get audio embeddings).
Also looking forward to the pre-process part!
@Allencheng97 @sakshamsingh1 Thanks for your interest. I think it would be good to refer to the code below.
n_mels = 128
time_length = 864
resize_resolution = 512
y, sr = librosa.load(wav_name, sr=44100)
audio_inputs = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)[0]
audio_inputs = librosa.power_to_db(audio_inputs, ref=np.max) / 80.0 + 1
zero = np.zeros((n_mels, time_length))
resize_resolution = 512
h, w = audio_inputs.shape
if w >= time_length:
j = 0
j = random.randint(0, w-time_length)
audio_inputs = audio_inputs[:,j:j+time_length]
else:
zero[:,:w] = audio_inputs[:,:w]
audio_inputs = zero
audio_inputs = cv2.resize(audio_inputs, (n_mels, resize_resolution))
This is a zero-shot audio-classifcation evaluation code.
with torch.no_grad():
text_tokens = torch.cat([clip.tokenize(text) for text in labels])
text_embedding = clip_model.encode_text(text_tokens.to(device)).float()
audio_embedding = audio_encoder(audio_inputs)
audio_embedding = audio_embedding / audio_embedding.norm(dim=-1, keepdim=True)
proj_per_audio = (audio_embedding @ text_embedding.T) * math.exp(0.07)
label_idx = torch.argmax(proj_per_audio, axis=1)
pred_category = labels[label_idx]
Hi, Thanks for the great work!!
The paper states that during part-1 training (i.e. CLIP-based Contrastive Latent Representation Learning step) you consider image, text and audio modalities. But the code only uses audio and text modality for this training part.
Is this an old code? Or I misinterpreted the training part in paper? Thanks