Open ryohachiuma opened 1 year ago
Hi, thank you very much for your interest in our work. I added a quick sanity check just to show that the extraction of latent concepts is working but forgot to remove that. If you want to keep it in to verify the extraction process, you can download the ground-truth labels here (https://drive.google.com/file/d/1i2M7Jr1C-sJ5Pn5PKSX7zcEr0hRp2dEr/view?usp=drive_link) and pass in the path here: https://github.com/rxtan2/AVSeT/blob/main/models/clip_latent_models.py#L468 . However, it is definitely not necessary besides as a sanity check, so feel free to delete the sanity check.
Hi rxtan2,
Thank you for your answer. I will use the npy file you provided.
Also, in the training script, unet7
is specified for the audio separator. However, there is no function that returns visually-conditioned/text-conditioned spectrogram masks.
https://github.com/rxtan2/AVSeT/blob/main/bottleneck_bimodal_mask_cyclic_combined.py#L105
Maybe, you used BimodalDistillAudioVisual7layerUNet
, instead of unet7
?
Hi rxtan2,
I have one minor question.
During the training, the latent concepts
are used to train the model instead of the text label.
However, as far as I understand from the code, are the latent captions also used during the evaluation?
Hi ryohachiuma,
Sorry for the delay. With regards to your penultimate question, that should be right. You can try using the latent captions during inference but I mainly experiment with actual language queries. As such, if you are using complex visual scenes and have the computational capacity, it may be worth trying more latent queries. I will find the cycles to make the code more easily understandable as well as remove the sanity check from the extraction script. :)
Hi,
Thank you for publishing the code in public.
Could you tell me what this
tmp
variable is at the following line? https://github.com/rxtan2/AVSeT/blob/main/models/clip_latent_models.py#L477I tried to obtain the latent feature but met this undefinition error.
Thank you in advance.