gretelai / gretel-synthetics

Synthetic data generators for structured and unstructured text, featuring differentially private learning.
https://gretel.ai/platform/synthetics
Other
579 stars 87 forks source link

List index out of range #151

Closed Vathsa28 closed 1 year ago

Vathsa28 commented 1 year ago

Hi i am getting this error while training on a financial dataset which I have. I was trying to play with the hyper parameters , to generate the data, where I am not necessarily getting good results. But when I increased the generator rounds and discriminator rounds to 2, After few epochs of training (22-24), this error pops up. It seems the len of inputs becomes 0 from 3 for that particular epoch. How do I fix this issue? Also what parameters do you recommend to train a bank transactional data and generate them? (For many attributes and features)

File ~/anaconda3/envs/dgtpy39env/lib/python3.9/site-packages/gretel_synthetics│/timeseries_dgan/dgan.py:851, in DGAN._discriminate(self, batch) 849 # Flatten the features 850 print("Hi ",len(inputs)) 851 inputs[-1] = torch.reshape(inputs[-1], (inputs[-1].shape[0], -1)) 853 input = torch.cat(inputs, dim=1) 855 output = self.feature_discriminator(input)

IndexError: list index out of range

Somehow the data is getting changed or something? Please help.

I am able to run upto 1200 epochs easily without playing with the generator and discriminator rounds parameter.

santhosh97 commented 1 year ago

Hey @Vathsa28 ! Would you mind providing more details about the size of your dataset and the parameters such as max sequence length, sample_len, the style of the dataframe? One thing that could happening is that the model might be producing NaN values due to mode collapse possibly and filtering those values out thus causing the size of the list to be modified. For better parameters, could you describe your dataset in more detail i.e how many features? how many attributes? time series length?

Vathsa28 commented 1 year ago

Sure, Features array(34720,50,2), Attributes array(34720,14) Time series length 50. Node2Vec embeddings. We are using numpy arrays instead of dfs. Max seq len=50, sample len =5, batch size=100, and number of layers i made it 16 for both generator and discriminator with 400 units each.

kboyd commented 1 year ago

Thanks for the details, @Vathsa28!

Nothing jumps out to me from your setup that would lead to this error. So my general suggestion in this situation would be to try lower learning rates for both generator and discriminator. With smaller updates each batch, hopefully the model will be more well-behaved, though probably takes more epochs to learn effectively.

The other place I'd recommend experimenting is size of the networks. 16 layers and 400 units each is fairly large for DGAN. Compare that to the original DoppelGANger paper with the WWT dataset (of roughly similar size to yours) that used 1 layer with 100 units for the feature generator, 3 layers with 100 units each for the attribute generator, and 5 layers with 200 units each for the discriminator. I don't have any particular reasoning for why this would fix your error, but it's worth a try. And you can start with smaller networks with faster training, and only move to a larger network if your task genuinely requires the deeper and wider models.

kboyd commented 1 year ago

Closing this issue since it's been a month+. Please reopen if you're still seeing this error.