Stochastic speaking styles and unpredictable uh's & umm's

I admit that Bark is realistic speech sounding.

But there are couple of issues as following. Could someone please help me fix them?

The output speech creates so many uh's and umm's even if none is present in the text.
The speaking style and (sometimes even) voice are different for the same speaker (passed to history_prompt argument of semantic_to_waveform() function) and also vary across different inferences. This reminds of Stable Diffusion which has generator argument to make outputs predictable. It'll be great to have such argument.
Sometimes there are a long pause before the sentence is spoken. (I am running long_form_generation.ipynb which splits the sentences using NLTK to avoid abruptness in long speech).

suno-ai / bark