kuai-lab / sound-guided-semantic-image-manipulation

Sound-guided Semantic Image Manipulation - Official Pytorch Code (CVPR 2022)
Other
81 stars 12 forks source link

Does part 1 of training (CLIP-based training) includes image modality? #9

Open sakshamsingh1 opened 2 years ago

sakshamsingh1 commented 2 years ago

Hi, Thanks for the great work!!

The paper states that during part-1 training (i.e. CLIP-based Contrastive Latent Representation Learning step) you consider image, text and audio modalities. But the code only uses audio and text modality for this training part.

Is this an old code? Or I misinterpreted the training part in paper? Thanks

lsh3163 commented 2 years ago

Dear @sakshamsingh1, Thanks for your interest!

Yes, as you mentioned, this code is the old version. For using both image and text for pre-training, you could download whole raw videos (image, audio, text) with yt_dlp. After that, we employ below scripts:

projection_audio_text = scale_constant1 * (audio_embedding @ text_embedding.T)
projection_audio_image = scale_constant2 * (audio_embedding @ image_embedding.T)

ce = torch.nn.CrossEntropyLoss()
text_contrastive_loss = ce(projection_audio_text, label) + ce(projection_audio_text.T, label)
image_contrastive_loss = ce(projection_audio_image, label) + ce(projection_audio_image.T, label)
loss = text_contrastive_loss + image_contrastive_loss

Thanks

sakshamsingh1 commented 2 years ago

Thanks @lsh3163 for the quick response. This makes sense.

I am very interested in your work and actively looking into it. Do you plan to push the newer code?

Thanks

lsh3163 commented 2 years ago

Yes, I plan to update this code later, but I am not sure when it will be. So, if you have any questions about the complete code, feel free to ask me.

Thanks

sakshamsingh1 commented 2 years ago

Great, Thanks for being so helpful. I have some questions:

  1. Which is the latest pre-trained audio encoder?

    • resnet18 in pre-trained folder: here
    • resnet18_57 provided in the Readme link: here
    • Or none of these (assuming this is an older code).
  2. Can you provide the code for zero-shot audio classification over ESC-50 dataset and US-8k dataset?

Thanks

lsh3163 commented 2 years ago

Dear @sakshamsingh1 This is my answer.

  1. resnet18 is newer one!
  2. Yes, I can provide it to you but the code has changed due to extension, so it needs to be modified to provide the original code. However, I'll attach the main code to this thread. Thanks. :)
sakshamsingh1 commented 2 years ago

Thanks @lsh3163. That would be great!!

In particular, I am interested in knowing how you pre-process audio before feeding it into the audio-encoder (to get audio embeddings).

Allencheng97 commented 2 years ago

Also looking forward to the pre-process part!

lsh3163 commented 2 years ago

@Allencheng97 @sakshamsingh1 Thanks for your interest. I think it would be good to refer to the code below.

n_mels = 128
time_length = 864
resize_resolution = 512
y, sr = librosa.load(wav_name, sr=44100)
audio_inputs = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)[0]
audio_inputs = librosa.power_to_db(audio_inputs, ref=np.max) / 80.0 + 1

zero = np.zeros((n_mels, time_length))
resize_resolution = 512
h, w = audio_inputs.shape
if w >= time_length:
   j = 0
   j = random.randint(0, w-time_length)
  audio_inputs = audio_inputs[:,j:j+time_length]
else:
   zero[:,:w] = audio_inputs[:,:w]
   audio_inputs = zero
   audio_inputs = cv2.resize(audio_inputs, (n_mels, resize_resolution))
lsh3163 commented 2 years ago

This is a zero-shot audio-classifcation evaluation code.

with torch.no_grad():
   text_tokens = torch.cat([clip.tokenize(text) for text in labels])
   text_embedding = clip_model.encode_text(text_tokens.to(device)).float()
   audio_embedding = audio_encoder(audio_inputs)
   audio_embedding = audio_embedding / audio_embedding.norm(dim=-1, keepdim=True)
   proj_per_audio = (audio_embedding @ text_embedding.T) * math.exp(0.07)
   label_idx = torch.argmax(proj_per_audio, axis=1)
   pred_category = labels[label_idx]