Semi-supervised training on the Tabular dataset

mortezamg63 commented 2 years ago

I am trying to use the code on adult dataset. I preprocess the dataset: generate label encoding for categorical columns and normalize the continuous columns in range [0,1] . Then the cardinality and index of each categorical column + the index of continuous columns are used in embedding layer to creating the semi-self-supervised framework. After training the framework, I used the encoder part for the downstream task (fig2 in the paper). I am getting the accuracy ≈79.5% (mixup on the random hiddent layers). I tried to copy everything from paper.

If I set encoder in the downstream task to None (line 120 in the train.py), the accuracy is ≈82.7% (mixup is applied on the first layer or the random layers of the predictor). I expect to get higher accuracy using trained encoder from the semi-self-supervised framework. But it is the opposite. Could you help me out on this?

I looked at the details in the paper, but I can not figure out the problem. Could you please share the source code for the tabular data?

sajaddarabi commented 2 years ago

Have you normalized the continuous columns? I used the Z normalization, and converted the categorical columns to one hot encodings. The "education-nums" column is also dropped from the dataset.

Here's the config file, but it won't run with this code base as the tabular dataset pipeline is not included (I'll see if I could push it EOW):


    "name": "adult_mixup",
    "n_gpu": 1,
    "unsupervised_arch": {
        "type": "AE",
        "args": {
            "input_dim": 102,
            "hidden_dim": [102],
            "mixup_method": "latent",
            "temperature": 0.1,
            "projection_head": "linear",
            "projection_dim": 102,
            "embed": false,
            "mixup_alpha": 0.2,
            "mixup_dist": "uniform",
            "mixup_n": -1
        }
    },
    "unsupervised_lr_scheduler": {
        "type": "StepLR",
        "args": {
            "step_size": 10,
            "gamma": 0.1
        }
    },
    "unsupervised_metrics": [] ,
    "unsupervised_optimizer": {
        "type": "RMSprop",
        "args": {
            "lr": 0.001,
            "alpha": 0.9,
            "momentum": 0.0,
            "eps": 1.0e-7
        }
    },
    "unsupervised_data_loader": {
        "type": "TabularDataLoader",
        "args": {
            "data_dir": "data/adult",
            "batch_size": 128,
            "labeled_ratio": 0.1,
            "shuffle": true,
            "method": "semisupervised",
            "validation_split": 0.10,
            "n_split": 0,
            "num_workers": 2,
            "preprocessing_dict": {
                "order": [
                    "standard_scaler", "convert_to_onehot"
                ],
                "standard_scaler": {
                    "cols": "continuous"
                },
                "convert_to_onehot": {
                    "cols": "categorical" 
                }
            }
        }
    },
    "unsupervised_trainer": {
        "validation_target": true, 
        "type": "Trainer",
        "module_name": "trainer",
        "epochs": 20,
        "save_dir": "saved/",
        "save_period": -1,
        "verbosity": 2,
        "monitor": "min val_loss",
        "early_stop": 10,
        "save_single_checkpoint": true,
        "save_only_best": true,
        "tensorboard": true,
        "log_step": 50,
        "pseudolabeling": {
            "epoch_start": 10,
            "f": 2,
            "args": {
                "k": 5, 
                "max_iter": 20,
                "alpha": 0.99 }
        }
    },
    "supervised_arch": {
        "type": "MLP",
        "args": {
            "input_dim": 14,
            "hidden_dim": [100, 100],
            "num_classes": 2,
            "mixup_method": "",
            "mixup_alpha": 2.0,
            "mixup_dist": "alpha",
            "mixup_n": -1,
            "K": 5,
            "consistency_method": "",
            "mixup_consistency_dist": "uniform",
            "mixup_consistency_alpha": 1.0,
            "mixup_consistency_n": -1,
            "fine_tune": false 
        }
    },
    "supervised_optimizer": {
        "type": "Adam",
        "args": {
            "lr": 0.0010939661837841578,
            "weight_decay": 0,
            "amsgrad": false 
        }
    },
    "supervised_metrics": ["accuracy", "roc_auc", "pr_auc"],
    "supervised_lr_scheduler": {
        "type": "StepLR",
        "args": {
            "step_size": 30,
            "gamma": 0.1
        }
    },
    "supervised_data_loader": {
        "type": "TabularDataLoader",
        "args": {
            "data_dir": "data/adult",
            "batch_size": 128,
            "labeled_ratio": 0.1,
            "shuffle": true,
            "validation_split": 0.10,
            "method": "semisupervised",
            "n_split": 0,
            "num_workers": 2,
            "preprocessing_dict": {
                "order": [
                    "standard_scaler", "convert_to_onehot"
                ],
                "standard_scaler": {
                    "cols": "continuous"
                },
                "drop_col": {
                    "cols": ["education-nums"]
                },
                "convert_to_onehot": {
                    "cols": "categorical" 
                }
            }
        }
    },
    "loss_args":{
        "cont_loss_type": "mse",
        "contrastive_weight": 0.75,
        "l2_weight_decoder": 0.01,
        "mixup_weight_decoder": 0.00,
        "recon_weight": 0.25,
        "vime_consistency_weight": 0.0,
        "mixup_consistency_weight": 0.0,
        "unlabeled_classification_weight": 0.10
    },
    "supervised_trainer": {
        "type": "Trainer",
        "module_name": "trainer",
        "epochs": 50,
        "save_dir": "saved/",
        "save_period": -1,
        "verbosity": 2,
        "monitor": "max val_accuracy",
        "early_stop": 50,
        "save_single_checkpoint": true,
        "save_only_best": true,
        "tensorboard": true,
        "log_step": 25 
    },
    "unsupervised_model_load_best": false,
    "supervised_model_load_best": true
}```

mortezamg63 commented 2 years ago

Hi Sajad, I really appreciate for you response. I normalized the continuous columns. But I have not converted the categorical data to one-hot encoding. I was trying to use embedding layers and label encoding. I assumed that the embedding layer works with label encoding values.

I will check it out by one-hot encoding, removing the "education-nums" column, and the config file. It will be great if you can push the code. Thank you very much. A quick question: When do you use embedding layer?

sajaddarabi commented 2 years ago

Embedding layers are used for categorical columns, though for the adult dataset, preprocessing the categorical columns to one hot label encoding helps performance. This is done in other methods as well, and we do so as well for the sake of comparison.

mortezamg63 commented 2 years ago

Thanks for your answer. I will work with on-hot encoding.

sajaddarabi / ContrastiveMixup

Semi-supervised training on the Tabular dataset #5