iiscleap / ZEST

Zero-Shot Emotion Style Transfer
37 stars 7 forks source link

How is the `f0.pickle` file generated? #8

Open jishnub opened 7 months ago

jishnub commented 7 months ago

This is being read in at https://github.com/iiscleap/ZEST/blob/255a7e506e777bf47fb4b5266dc1a49dff126d60/code/F0_predictor/config.py#L10 https://github.com/iiscleap/ZEST/blob/255a7e506e777bf47fb4b5266dc1a49dff126d60/code/F0_predictor/pitch_attention_adv.py#L53-L54 and this appears to contain a dictionary of the form filename: vector. How are these vectors generated? I think these would need to be regenerated when using audio files not present in the original dataset?

Reading the paper, these are probably generated using the YAAPT algorithm? Could you point to an existing code for this?

Is this generated using the get_yaapt_f0 function? https://github.com/iiscleap/ZEST/blob/255a7e506e777bf47fb4b5266dc1a49dff126d60/code/HiFi-GAN/dataset.py#L27-L43

The results seem a bit different.

iiscleap commented 3 months ago

You can use any pitch computation algorithm in librosa or yaapt for computing the pitch. There might be minor differences, but it did not affect the generation quality for me. I used pyaapt because it was faster than the librosa library. I don't remember the parameters I used, but using the above code snippet should work just fine.

Let me know if you face any issues by using the above code snippet.