Feature request: integer data

michaelpollmann commented 1 year ago

it would be great if there was a way to simulate integer / ordered categorical data, say age in years. Treating it as a categorical variable seems to yield data sets where other variables are less smooth in age than desired and probably also increases the complexity of the training task (by turning each value into a dummy?). Treating it as a continuous variable requires rounding ex post, but ideally the rounding would happen even in training?

Jonas-Metzger commented 1 year ago

You can't just insert a rounding step during training, as the sampling step must be differentiable. With the current package, I'd recommend treating integers as continuous variables and rounding ex-post. For any ordered categorical, I'd recommend mapping them to integers and doing the same. We could automate this by modifying the DataWrapper.

In your experience, is there anything wrong with the data generated in that way? I wouldn't expect there to be any, if the GAN is flexible enough it should spit out numbers that are really close to integers anyway.

If you observe specific failure case with that approach, we can think of modified sampling procedures for integers during training. But any approach that immediately comes to mind ends up being either really close to the categorical case or to the continuous case with ex-post rounding, which is why I didn't consider it worth implementing so far.

michaelpollmann commented 1 year ago

I was hoping that one could do something like a softround analog to round similar to softmax and max. Tensorflow even seems to have such a function soft_round: https://www.tensorflow.org/api_docs/python/tfc/ops/soft_round which I found here: https://stackoverflow.com/a/74474397 with its python code.

Doing the rounding in the DataWrapper (irrespective of whether a softround function is implemented) might be convenient in particular for hierarchical/multi-step models where the training with real data of later steps is done using the actual integer data, but there is a risk that in generating the data one forgets to round in between steps. For instance, if I have trained B|A and then C|A,B with an integer B, I would generate B|A first, but then should probably round the artificial B before generating C. Doing the rounding automatically would avoid "user error" like not rounding in between steps.

I have a data set with some integer, some binary, and some continuous variable(s). Loosely speaking, when looking at conditional means of one variable conditional on another, the curves look more different when using the integer variable. But of course, there may be many other reasons for that, so I'll need to do more testing to see what's really going on. My expectation is that some "internal" rounding is likely to yield a better result, though I don't know if that improvement is worth the effort of implementing it.

Jonas-Metzger commented 1 year ago

Yeah, I'd say using the soft-round function would be pretty close to the continuous-case with ex-post rounding. After all, you do generate a continuous variable with real-valued support, and if you want it to be an actual integer you have to round ex-post.

On the one hand, softround may "help" training because the generator won't have to learn to produce numbers close to integers. But you may also hurt training because it still has to learn to produce the right integers, and softround doesn't do a good job at propagating gradients - they either get pushed close to zero or get amplified a lot:

Another option would be the following function with biased gradients:

gradround = lambda x: x.round()-x.detach()+x

which is equal to x.round() in the forward pass, but has the same gradients as the identity function. This avoids the problem above and spits out actual integers, which is why tricks like that are more popular for discretized applications than smooth approximations. But biased gradients at least in theory still make it harder for SGD to work well, so I'd avoid it since it's unnecessary in theory - the current WGAN should be consistent even with discrete support, in theory.

But yeah, I'm not surprised that training is a bit harder for integers. Does the slightly worse fit you observe for integers go away if you just train longer and increase the GAN size? Otherwise we could use your experiment as testing ground to see if a gradround or softround would improve things, happy to help with that.

michaelpollmann commented 1 year ago

That's a good explanation, thank you! I'll try to do some more testing with my integer data, but might not get to it until some time in December.

gsbDBI / ds-wgan

Feature request: integer data #13