ivcylc / qa-mdt

OpenMusic: SOTA Text-to-music (TTM) Generation
https://qa-mdt.github.io
MIT License
491 stars 47 forks source link

Pretrained models #3

Closed Nian-Chen closed 1 month ago

Nian-Chen commented 2 months ago

According to the content of offset_pretrained_checkpoints.json, there should be four models, but there is only one model available on Baiduyun. Running infer.sh to load some models results in an error. Hope this can be resolved. Thanks!

ivcylc commented 2 months ago

Hi, you can download them under link provided in readme.md as illustrated, they should be: flan-t5-large

clap_music

roberta-base

others

Nian-Chen commented 2 months ago

Hi, you can download them under link provided in readme.md as illustrated, they should be: flan-t5-large

clap_music

roberta-base

others Hi! An error occurred during running infer.sh: Non-fatal Warning [dataset.py]: The wav path " " is not find in the metadata. Use empty waveform instead. This is normal in the inference process. Error encounter during audio feature extraction: mel() takes 0 positional arguments but 5 were given Theoretically, wav path should not be used. Where do I need to modify the code?

ivcylc commented 2 months ago

Hi, you can download them under link provided in readme.md as illustrated, they should be: flan-t5-large clap_music roberta-base others Hi! An error occurred during running infer.sh: Non-fatal Warning [dataset.py]: The wav path " " is not find in the metadata. Use empty waveform instead. This is normal in the inference process. Error encounter during audio feature extraction: mel() takes 0 positional arguments but 5 were given Theoretically, wav path should not be used. Where do I need to modify the code?

I also met this problem before, this is a problem caused by Library function version of librosa. My solution was to modify it directly in the library package: use: mel = librosa_mel_fn(sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax) instead of mel = librosa_mel_fn(sampling_rate, n_fft, num_mels, fmin, fmax) can help

Nian-Chen commented 2 months ago

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

  1. The GPU memory occupies 25G.
  2. It takes 5 minutes to generate a 10s audio; At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?
ivcylc commented 2 months ago

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

  1. The GPU memory occupies 25G.
  2. It takes 5 minutes to generate a 10s audio; At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

Well, it is not normal. For me, it can be infered in NVIDIA-V100-24GB within 25s The memory it uses is probably close to 24GB. If you're in an OOM situation, you might consider using flan-t5 or hifi-gan for CPU inference, leaving the MDT model part on the GPU.

Comparing with Musicgen, this project should ensure that both the quality of the content of the music and the aesthetic musicality are superior to Musicgen. and our approach innovately introduced a quality-aware training strategy with much smaller parameter request to Musicgen (675M vs 3.3B) and open-source training set

ivcylc commented 2 months ago

However, we have to admit that our music length is limited to 10s (It can be extended, but we haven't done it yet), additionally, you can change the DDIM inference with more advanced solver (e.g. DPM, consistency model) to enhance speed.

JonathanFly commented 2 months ago

However, we have to admit that our music length is limited to 10s (It can be extended, but we haven't done it yet), additionally, you can change the DDIM inference with more advanced solver (e.g. DPM, consistency model) to enhance speed.

Does "It can be extended" mean you could train a whole new model on music crops that are longer than 10s? Or does it mean it should be possible to inference from the current model in some way that allows for music longer than 10s? Well probably at least a sliding-window is possible I guess.

ivcylc commented 2 months ago

However, we have to admit that our music length is limited to 10s (It can be extended, but we haven't done it yet), additionally, you can change the DDIM inference with more advanced solver (e.g. DPM, consistency model) to enhance speed.

Does "It can be extended" mean you could train a whole new model on music crops that are longer than 10s? Or does it mean it should be possible to inference from the current model in some way that allows for music longer than 10s? Well probably at least a sliding-window is possible I guess.

There are two types of the technical solutions, they all work, they just have different effects. Our U-net based model can generate longer than 10s in zero-shot, i will share my checkpoint~ There exists some essay submmited to ICASSP25 to show how to finetuning it to generate long, please just wait~

diggle001 commented 1 month ago

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

  1. The GPU memory occupies 25G.
  2. It takes 5 minutes to generate a 10s audio; At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

Well, it is not normal. For me, it can be infered in NVIDIA-V100-24GB within 25s The memory it uses is probably close to 24GB. If you're in an OOM situation, you might consider using flan-t5 or hifi-gan for CPU inference, leaving the MDT model part on the GPU.

Comparing with Musicgen, this project should ensure that both the quality of the content of the music and the aesthetic musicality are superior to Musicgen. and our approach innovately introduced a quality-aware training strategy with much smaller parameter request to Musicgen (675M vs 3.3B) and open-source training set

Hello, I used your gradio code to test the generated music, it took about 6 minutes, using A100, and occupied about 50G of gpu memory. I observed that when generating music, the video memory utilization was very small in the first period, and became very large at the end. Do you know how to solve the above problem?

ivcylc commented 1 month ago

What means the first period?

diggle001 commented 1 month ago

What means the first period?

It means the initial period of the entire generation process. (Some time after the beginning of the music generation process)

ivcylc commented 1 month ago

I have no idea about, does it exceed the A100 80G memory limitation?

diggle001 commented 1 month ago

I have no idea about, does it exceed the A100 80G memory limitation?

It will not exceed, and the GPU memory occupied is always about 50G

ivcylc commented 1 month ago

Sorry i don not know why, maybe you can check your pipeline with this Youtube inference pipeline.

diggle001 commented 1 month ago

Sorry i don not know why, maybe you can check your pipeline with this Youtube inference pipeline.

OK, thanks for your reply

diggle001 commented 1 month ago

By the way, when I try the demo:url, you provided to generate music, an error will be reported after running for 120 seconds: error: GPU task aborted.

diggle001 commented 1 month ago

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

  1. The GPU memory occupies 25G.
  2. It takes 5 minutes to generate a 10s audio; At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

@Nian-Chen hello, have you solved the problem of taking a long time to generate music?

ivcylc commented 1 month ago

@jadechoghari Hi jade, can you help check?

ivcylc commented 1 month ago

Thank you for your reply. Now that it's done, there are two questions that need to be confirmed with you:

  1. The GPU memory occupies 25G.
  2. It takes 5 minutes to generate a 10s audio; At present, the cost is quite high. Are these normal? In addition, do you think there are any advantages of this project compared with MusicGen? The sound quality seems better?

@Nian-Chen hello, have you solved the problem of taking a long time to generate music?

I have to claim that the inference in my A100 is within 20s, any longer required time must contains some improper things

jadechoghari commented 1 month ago

Hi!

diggle001 commented 1 month ago

Hi, my gpu specs is A100-SXM4-80GB. After I reinstalled the basic environment, I re-ran gradio_app and found that it only took 135s now, which is still a long way from the 20s you mentioned. The gpu memory occupied remained unchanged at 27091MiB. The following is the log output during the running process. I found that some models would be reloaded every time a generation request was initiated. This process takes a long time. Does this happen to you?

Seed set to 0
The input shape to the diffusion model is as follows:
xc torch.Size([3, 8, 256, 16])
t torch.Size([3])
context_0 torch.Size([3, 1, 1024]) torch.Size([3, 1])
INFO: clap model calculate the audio embedding as condition
Similarity between generated audio and text tensor([0.3477, 0.3925, 0.3128], device='cuda:0')
Choose the following indexes: [1]
debug_name : awesome.wav
Waveform saved at -> ./awesome.wav
Plotting: Restored training weights
Add-ons: [<function waveform_rs_48k at 0x7f0b7c3bf130>]
Dataset initialize finished
Reload ckpt specified in the config file ./qa_mdt/checkpoint_389999.ckpt
LatentDiffusion: Running in eps-prediction mode
mask ratio: 0.3 decode_layer: 8
DiffusionWrapper has 676.25 M params.
Keeping EMAs of 489.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 8, 64, 64) = 32768 dimensions.
making attention of type 'vanilla' with 512 in_channels
loaded pretrained LPIPS loss from taming/modules/autoencoder/lpips/vgg.pth
Removing weight norm...
Initial learning rate 1e-05
--> Reload weight of autoencoder from ./qa_mdt/checkpoints/hifi-gan/checkpoints/vae_mel_16k_64bins.ckpt
Waveform save path:  ./log/latent_diffusion/qa_mdt/mos_as_token/val_0_10-12-01:58_cfg_scale_3.5_ddim_200_n_cand_3
Plotting: Switched to EMA weights
Non-fatal Warning [dataset.py]: The wav path "  " is not find in the metadata. Use empty waveform instead. This is normal in the inference process.
Use ddim sampler
Data shape for DDIM sampling is (3, 8, 256, 16), eta 1.0
Running DDIM Sampling with 200 timesteps
DDIM Sampler: 100%|██████████| 200/200 [00:33<00:00,  5.94it/s]
ivcylc commented 1 month ago

This situation is normal nad correct.As you see,the model need some time to set up.After the serup,33 seconds are occupied to generate one audio.If you want to generate N Audio.it will need time(setup)+N*33 in your machine

diggle001 commented 1 month ago

Hi, After I start the gradio service, each request takes about 120 seconds. Shouldn't all the models be loaded in the first request? Then there is no need to load the models again. The subsequent requests should take about 33 seconds as you said.

ivcylc commented 1 month ago

I will update the code today

jadechoghari commented 1 month ago

hi @diggle001 the online demo has been made to give you an idea about how the model works/performance, to gain full control (speed, etc..), I suggest building the model locally following this documentation: https://huggingface.co/jadechoghari/openmusic, let us know how much time it takes to generate.

ivcylc commented 1 month ago

Hi, After I start the gradio service, each request takes about 120 seconds. Shouldn't all the models be loaded in the first request? Then there is no need to load the models again. The subsequent requests should take about 33 seconds as you said.

For one setup and continuously inferenece, i recommend the locally inference. After you setup the environment from the requirement.txt from gradio folder, what you need to do is to put all captions you want to generate the audio in the list_inference prompts/good_prompts_1.lst illustrated in infer.sh and run infer.sh instead.

diggle001 commented 1 month ago

Hi, I have now changed the code. Excluding the initial model loading time, each subsequent inference time is about 30s.

jadechoghari commented 1 month ago

what was the issue ?

cc @ivcylc