sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.25k stars 282 forks source link

limit on the number of columns of a table #95

Closed nisarkhanatwork closed 3 years ago

nisarkhanatwork commented 3 years ago

I tried to fit data that had 15-16k columns and ctgan tried to allocate 20 TB of memory and got runtime error shown below. Is there any limit on the number of columns we can have ?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-24-5b096eea5d68> in <module>
----> 1 ctgan.fit(exprT, discrete_columns, epochs = 5)

~/anaconda3/envs/genome/lib/python3.6/site-packages/ctgan/synthesizer.py in fit(self, train_data, discrete_columns, epochs, log_frequency)
    170                 self.embedding_dim + self.cond_generator.n_opt,
    171                 self.gen_dim,
--> 172                 data_dim
    173             ).to(self.device)
    174 

~/anaconda3/envs/genome/lib/python3.6/site-packages/ctgan/models.py in __init__(self, embedding_dim, gen_dims, data_dim)
     67             seq += [Residual(dim, item)]
     68             dim += item
---> 69         seq.append(Linear(dim, data_dim))
     70         self.seq = Sequential(*seq)
     71 

~/anaconda3/envs/genome/lib/python3.6/site-packages/torch/nn/modules/linear.py in __init__(self, in_features, out_features, bias)
     76         self.in_features = in_features
     77         self.out_features = out_features
---> 78         self.weight = Parameter(torch.Tensor(out_features, in_features))
     79         if bias:
     80             self.bias = Parameter(torch.Tensor(out_features))

RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 21584386400400 bytes. Error code 12 (Cannot allocate memory)
Baukebrenninkmeijer commented 3 years ago

There is no hard limit to the number of columns, but rather to the underlying encoding that you get. Since you're running into hardware limits, the limit also depends on the hardware you are running. However, with 2TB, there is no hardware that can support that. With 15k columns, it's always gonna be quite hard. You can try a very small batch size and see if that works (20 samples, for example).

However, it seems it's erroring in the Linear layer, meaning that it cannot allocate a tensor of shape (16k, 256), the default size of the Linear layer. So in that case the batch size won't matter, but you can try to reduce the size of the latent vector to something lower than 256. You will need about a 100x size reduction, which would get you back to about 20GB memory required. Then with a bit of tuning, you should be in normal GPU VRAM range.

In general, I think it's best if you try to reduce your 16k columns.

nisarkhanatwork commented 3 years ago

Thank you for answering.

However, it seems it's erroring in the Linear layer, meaning that it cannot allocate a tensor of shape (16k, 256), the default size of the Linear layer. So in that case the batch size won't matter, but you can try to reduce the size of the latent vector to something lower than 256. You will need about a 100x size reduction, which would get you back to about 20GB memory required. Then with a bit of tuning, you should be in normal GPU VRAM range.

In general, I think it's best if you try to reduce your 16k columns.

Since this is gene expression data, I am thinking all columns will help in generating more realistic data. Quoting from this issue @csala : https://github.com/sdv-dev/CTGAN/issues/64#issuecomment-724921556

In any case, in the future we do not discard working on the possibility of sampling from N random variables conditioning on M other variables, but that would be something that escapes the current CTGAN architecture and which would probably need a major rework of the project, so I think that we can close this issue and open a new one with all the details when the time comes to work on it.

I understood that it would be good to use all the variables...

I also tried to use the same data frame but by reducing the prediction to one column

from ctgan import CTGANSynthesizer
ctgan = CTGANSynthesizer()
ctgan.fit(dataframe, discrete_columns[0:1], epochs = 5) 

but I am encountering the following error: https://github.com/sdv-dev/CTGAN/issues/93#issue-747990431

Thank you

csala commented 3 years ago

Hi @nisarkhanatwork I'm a bit confused about what you are trying to achieve, especially regarding this:

I also tried to use the same data frame but by reducing the prediction to one column

Would you mind providing some clarification about what those 16k columns are (data types, etc.) and what your goal is?

On the other side, as @Baukebrenninkmeijer mentioned, there is not "hard limit" on the amount of rows or columns that you try to learn, but even if CTGAN managed to fit all the tensors in the VRAM, learning the correlations across 16k columns sounds like too much for a model like CTGAN to handle, so you may really need to reduce the size of your data.

nisarkhanatwork commented 3 years ago

"I also tried to use the same data frame but by reducing the prediction to one column"

Regarding this, I tried generating data for 1 column and 10 columns but got errors referred in https://github.com/sdv-dev/CTGAN/issues/93#issue-747990431

Would you mind providing some clarification about what those 16k columns are (data types, etc.) and what your goal is?

I have gene expression data having 16k different genes for each sample. And each gene expression value is an integer. Since this is gene expression data, I am thinking all columns will help in generating more realistic data.

On the other side, as @Baukebrenninkmeijer mentioned, there is not "hard limit" on the amount of rows or columns that you try to learn, but even if CTGAN managed to fit all the tensors in the VRAM, learning the correlations across 16k columns sounds like too much for a model like CTGAN to handle, so you may really need to reduce the size of your data.

Can you please tell me some techniques to reduce the size ?

csala commented 3 years ago

"I also tried to use the same data frame but by reducing the prediction to one column"

Regarding this, I tried generating data for 1 column and 10 columns but got errors referred in #93 (comment)

I think that there may be some misunderstanding about what CTGAN does. CTGAN is not a supervised predictive model which you can use to obtain the value of one or more columns based on the values of the other columns.

What CTGAN does is generate entire rows as indistinguishable as possible from the ones seen beforehand on the real dataset: If your dataset has 10 columns, CTGAN will sample rows with 10 variables; if your dataset has 16k columns, CTGAN will sample rows with 16k variables.

And, at most, CTGAN allows you to pass the value that you want for one of the columns in order to increase the probability of that value showing up in the generated rows.

Since this is gene expression data, I am thinking all columns will help in generating more realistic data.

So, continuing from the comment above, I'm afraid having more columns will not really help in making the sampled data more realistic but rather the opposite. CTGAN will just learn to replicate what it has seen beforehand, and the more columns it has to learn the harder it is for it to do so.

In any case, it is not completely clear to me what you are trying to achieve. Can you describe it with some more detail? What is the outcome that you expect from CTGAN?

nisarkhanatwork commented 3 years ago

In any case, it is not completely clear to me what you are trying to achieve. Can you describe it with some more detail? What is the outcome that you expect from CTGAN?

I am considering my data as a table having 16k columns and 236 rows and I want to generate new rows using CTGAN.I am expecting CTGAN to learn from the given (236 rows) gene expressions and generate new ones. As far as I know the earlier tools that generate synthetic gene data are not based on GANs and I am thinking CTGAN will really help in this area.

Here is the link to colab notebook:

https://colab.research.google.com/drive/1Imuu8YiVKDIFWunu9dGW4xHFvGWuTWR2?usp=sharing

Here is the data for your perusal:

https://drive.google.com/drive/folders/1TWkmk4VYIu8cE1thXdkqXmAejVZNNBea?usp=sharing

Please see the following link to see that there are efforts going on in using simulated gene data: https://builtin.com/artificial-intelligence-machine-learning/geneticists-turn-deep-learning-algorithms-genome-pattern

csala commented 3 years ago

Hi @nisarkhanatwork

I'm afraid I don't think CTGAN (or any other synthetic data modeling engine) can handle a scenario lake the one you are exposing successfully.

But not because of the resources, but rather because in 236 rows there is not enough information for the model to learn the correlations across 16k columns.

In other words, no matter how good or efficient the model is, the 236 rows do not contain enough information for the model to learn how each one of the 16k columns is related to each other.

Your best bet in a situation like this is to look at other options than modeling, like simulating it from a known parametric equation or altering the real data with some type of controlled random noise.

nisarkhanatwork commented 3 years ago

Thanks a lot @csala for guiding me. I will look into the options that you suggested. [Adding information that 20 TB memory requirement for large number of columns is not there. When I deleted duplicate column names as suggested by @fealho CTGAN worked well on my 8 GB RAM laptop. Please refer to: #93 ]

csala commented 3 years ago

As seen in #93, the number of columns was not really a problem, so this can be closed.