Closed NZqian closed 7 months ago
It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?
It seems that the the model is contitioned on text embedding in the config, while the paper concludes that it is better to use audio embedding, so which one is better?