rxtan2 / AVSeT

15 stars 4 forks source link

the variable "tmp" is not defined #3

Open ryohachiuma opened 1 year ago

ryohachiuma commented 1 year ago

Hi,

Thank you for publishing the code in public.

Could you tell me what this tmp variable is at the following line? https://github.com/rxtan2/AVSeT/blob/main/models/clip_latent_models.py#L477

I tried to obtain the latent feature but met this undefinition error.

Thank you in advance.

rxtan2 commented 1 year ago

Hi, thank you very much for your interest in our work. I added a quick sanity check just to show that the extraction of latent concepts is working but forgot to remove that. If you want to keep it in to verify the extraction process, you can download the ground-truth labels here (https://drive.google.com/file/d/1i2M7Jr1C-sJ5Pn5PKSX7zcEr0hRp2dEr/view?usp=drive_link) and pass in the path here: https://github.com/rxtan2/AVSeT/blob/main/models/clip_latent_models.py#L468 . However, it is definitely not necessary besides as a sanity check, so feel free to delete the sanity check.

ryohachiuma commented 1 year ago

Hi rxtan2,

Thank you for your answer. I will use the npy file you provided. Also, in the training script, unet7 is specified for the audio separator. However, there is no function that returns visually-conditioned/text-conditioned spectrogram masks. https://github.com/rxtan2/AVSeT/blob/main/bottleneck_bimodal_mask_cyclic_combined.py#L105

Maybe, you used BimodalDistillAudioVisual7layerUNet, instead of unet7?

ryohachiuma commented 11 months ago

Hi rxtan2,

I have one minor question. During the training, the latent concepts are used to train the model instead of the text label. However, as far as I understand from the code, are the latent captions also used during the evaluation?

rxtan2 commented 11 months ago

Hi ryohachiuma,

Sorry for the delay. With regards to your penultimate question, that should be right. You can try using the latent captions during inference but I mainly experiment with actual language queries. As such, if you are using complex visual scenes and have the computational capacity, it may be worth trying more latent queries. I will find the cycles to make the code more easily understandable as well as remove the sanity check from the extraction script. :)