lschmiddey / deep_tabular_augmentation

MIT License
20 stars 9 forks source link

About Deep Tabular Augmentation for Regression Tasks #18

Open Yesim7 opened 1 year ago

Yesim7 commented 1 year ago

Hello Lasse (@lschmiddey ),

I am Yesim. I read your blog and wanna ask something about "Deep Tabular Augmentation for Regression Tasks". I didn't understand your some steps. For example, DataBunch, Autoencoder. Are they running with random parameters? Or how did you determine the value you gave to VAE_arch? Could you help me? Could you suggest a resource that has been explained in detail?

Sincerely,

lschmiddey commented 1 year ago

Hi Yesim,

I suggest (if you havent already) reading the last of my blogposts on Deep Tabular Augmentation: https://lschmiddey.github.io/fastpages_/2022/04/23/DataAugmentation_for_Regression_Tasks.ipynb.html

But I will also try to explain it here. The DataBunch class is basically just a helper class for bundling pytorch datasets and pytorch dataloaders. You could also provide your own dataloader to the (model)class Autoencoder. The only thing it needs to be is a pytorch dataloader, otherwise the training will not work. How the DataBunch is defined can be found here: https://github.com/lschmiddey/deep_tabular_augmentation/blob/main/deep_tabular_augmentation/dataloaders.py -> I provided the DataBunch with reasonable default values (like batch_size (bs) of 128) but you can easily change this:

data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))

When it comes to the model the architecture which I usually use is:(VAE_arch = [50, 12, 12]). However, this depends on the data you have and you can play around with it. Usually, with more variables you can also try using higher values here, but I found this architecture being a good default architecture.

When it comes to understanding VAE a bit better, I suggest this article: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

Usually VAE is used for images (for example VAE is part of the stable diffusion model) but the general idea can also be applied to tabular data.

Cheers, Lasse

Yesim7 commented 1 year ago

Hello again Lasse (@lschmiddey) ,

Thank you for the links you shared. There are a few things that still come to my mind. I have a physics experiment dataset with two columns as independent variables and one dependent variable. The first of the independent variables consists of positive integers between 0 and 10. The second of the independent variables consists of numbers floating between 0 and 1. I also have the variable y, which is dependent on them. For example, x1, x2, y 1, 0.02, 150.47 1, 0.3, 147.60 1, 0.4, 140.5 ... 10, 0.01, 162.8 10, 0.4, 145.3 10, 0.5, 178.3 I have a dataset as above.

I want to get the y value for all values in between using the deep_tabular_augmentation module you shared. I expected values to be produced between 1 and 10 for the first column and between 0.02 and 0.5 for the second column and positive. However, when I import the module and run the code, I get negative and random values. Actually I want to do something like interpolation. I hope I was able to explain. I would be grateful if you could help me with this.

Sincerely,

lschmiddey commented 1 year ago

I will try to reproduce it and inform you about my findings :)

Yesim7 commented 1 year ago

Thank you so much, waiting for your return:)

Yesim7 commented 1 year ago

Hi @lschmiddey , Did you have time to try?:)