Question about timestep used

zealscott commented 7 months ago

Hi,

Thanks for you great work and code! I have a small question about the timestep used in here: https://github.com/gianlucatruda/TableDiffusion/blob/67cef2a1cd2948ad0d6bef39ffbe2a8df3d0b88a/tablediffusion/models/table_diffusion.py#L84 In TableDiffusion, it seems to use a very small number of steps for training and sampling, while typical diffusion models (like DDPM and TabDDPM) need ~100 - ~1k timesteps. I just want to know what is consideration behind it? Why a larger timestep may not work on TableDiffusion?

Thank you and looking forward to your reply :)

gianlucatruda commented 7 months ago

Hi, @zealscott

Thanks for reaching out. Apologies for not getting back sooner — somehow missed the notification until yesterday.

tldr; I did some experimenting and fewer steps worked as well or better than more steps. I have some guesses as to why, but I didn't get to look into it thoroughly.

I talk about this briefly on page 31 of the paper. But, to elaborate:

When developing TableDiffusion, I did a fair bit of experimenting with schedulers and diffusion steps. I read in Nichol and Dhariwal's paper [1] about the performance improvements of using a cosine scheduler (over linear) and implemented that.

In my initial implementation, I was doing a DP-SGD update between each diffusion step, so there was good motivation to keep steps as low as possible. In the improved version (which is in the paper and code), I aggregated over the diffusion steps and then took a single DP-SGD update — this is much more privacy efficient and decouples the privacy loss from the number of diffusion steps.

So that left the question of how many steps: From a theory perspective, the network is small and not very deep (compared to diffusion models in the image domain), so I was expecting fewer steps would be necessary. I ran some experiments (see Section 6.5 of the paper) with various diffusion step sizes and noticed that optimal performance was always well under 10 steps with the scheduler I implemented.

I only discovered the TabDDPM work shortly before publishing the pre-print, but noticed that they used many more steps. If I were to guess, I'd assume it's because their implementation more directly follows the original diffusion work in the image domain. My implementation was much leaner and stripped back for TableDiffusion. But I didn't get the chance to look into any of that thoroughly.

As I hint at towards the end of the paper, one of the most interesting lines of future work for TableDiffusion would be to explore how different schedulers and diffusion steps affect the performance. But that was a bit beyond scope for that work. Would love to know what you find if you experiment with it!

I hope that clarifies. Let me know if you have further questions.

[1] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.

zealscott commented 7 months ago

Hi @gianlucatruda

Thanks for your detailed reply!! I agree that tabular data synthesis may not need many steps to achieve good performance. For TabDDPM, it may need much larger timesteps because (1) it uses a deeper neural network compared with tablediffusion. (2) it has a separate multimodal diffusion process.

Interestingly, we find that although diffusion performs astonishingly well on tabular data synthesis, the performance drops dramatically when adding DP-SGD to it. We also find CTGAN may not be a good baseline since the generator can not learn any useful information about marginal distribution during training! More findings are in our paper (just released a few days ago!):

arxiv.org/abs/2402.06806

We also integrate your work to our benchmarks:

https://github.com/zealscott/SynMeter

Thanks for your great work again!

gianlucatruda commented 6 months ago

@zealscott That's awesome! Thanks for letting me know. I'll check it out

gianlucatruda / TableDiffusion

Question about timestep used #1