jrzaurin / pytorch-widedeep

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
Apache License 2.0
1.26k stars 186 forks source link

save_best_only error and NaN during training #172

Closed taokz closed 10 months ago

taokz commented 11 months ago

Hi

Thank you for your awesome repo. I encountered two issues:

  1. save_best_only=True incurs AttributeError: 'ModelCheckpoint' object has no attribute 'best_epoch' The corresponding code snippet is:
model_checkpoint = ModelCheckpoint(
    filepath=f'./checkpoints/{args.model}/{args.country}/chkp',
    save_best_only=True,
    max_save=1,
)

callbacks = [
    LRHistory(n_epochs=10),
    EarlyStopping(patience=10),
    model_checkpoint,
]
metrics = [auroc]

if args.model == 'tabnet':
    trainer = Trainer(
        model,
        objective="binary",
        optimizers=torch.optim.Adam(model.parameters(), lr=0.01),
        callbacks=callbacks,
        metrics=metrics,
        verbose=verbose,
        seed=args.seed,
    )
  1. For transformers, I load the data without the categorical features, and I use the following code for training but i got NaN loss, could you provide some insights about that?
            if args.model == 'tab_transformer':
                deeptabular = TabTransformer(
                    column_idx=column_idx,
                    cat_embed_input=None,
                    continuous_cols=continuous_cols,
                    embed_continuous=True,
                    n_blocks=4,
                )

            model = WideDeep(deeptabular=deeptabular)

            wide_opt = None
            deep_opt = torch.optim.Adam(model.deeptabular.parameters(), lr=0.01)
            wide_sch = None
            deep_sch = torch.optim.lr_scheduler.StepLR(deep_opt, step_size=5)

            optimizers = {"deeptabular": deep_opt}
            schedulers = {"deeptabular": deep_sch}
            initializers = {"deeptabular": XavierNormal}

            trainer = Trainer(
                model,
                objective="binary",
                optimizers=optimizers,
                lr_schedulers=schedulers,
                initializers=initializers,
                callbacks=callbacks,
                metrics=metrics,
                verbose=verbose,
                seed=args.seed,
            )

I appreciate your help in advance! Thanks!

jrzaurin commented 11 months ago

@5uperpalo, can you have a look to this? 👆🏼

5uperpalo commented 11 months ago

sure, I'll look into it and respond by tomorrow lunch

5uperpalo commented 11 months ago

@taokz both issues are likely connected to the data you are using; can your share sample of the data? to anonymize it, you may change column names and use simple .sample() method on the dataframe ...

What I did:

Next steps:

  1. try to make a clean install and updated your libraries by pip install -r requirements.txt -U
  2. check if you have the correct data types in your dataframe (i.e. values in X_tab you are passing to Trainer), are they correct, e.g. no strings NAs, or objects?
  3. ISSUE num.1 : when you use verbose=True do yu have a non-NA loss? best_epoch is saved only if the monitor is is working and if monitored metric improves (be default validation loss ); do you see non-NA verbose output?
  4. ISSUE num.2, again, must be related to the data that you are passing into the Trainer, if it's not possible to share the data, could you please try to compare it to the povided example notebook?
ibowennn commented 11 months ago

Perhaps the reason for the issue is that there are NaN values in your data. I faced a similar problem, but it was resolved when I used dropna() on my data.

taokz commented 11 months ago

Perhaps the reason for the issue is that there are NaN values in your data. I faced a similar problem, but it was resolved when I used dropna() on my data.

@ibowennn Thank you for your reminder. However, I have checked my data and there are no NaN values.

taokz commented 11 months ago

@5uperpalo I kindly appreciate your quick reply.

For issue 1, I noticed that I wrongly used the fit() method:

trainer.fit(
    X_tab=X_num_train,
    target=y_train,
    X_tab_val=X_num_valid, # there is no such a variable in the base_trainer
    target_val=y_valid, # there is no such a variable in the base_trainer
    n_epochs=2,
    batch_size=1024,
)

I modified it to be the following, and it works (for TabNet).

trainer.fit(
    X_tab=X_tab,
    target=target,
    n_epochs=2,
    batch_size=1024,
    val_split=0.2
)

However, I still can not solve the issue 2, and there is nan loss (for transfomers such as tab_transformer). I guess it is because I pass cat_embed_input=None, because my data just have continuous features. Is it required to set cat_embed_input != None for transformer-based models? The example notebook may has wrong link. Do you mean This?

BTW, I am sorry that the data is private and I can not share it here. We can just take it as a table with only numerical values, and there is no NaN.

5uperpalo commented 11 months ago

I am sorry for late response @taokz, I had some personal issues holding me back... last time I did not upload the latest version of the troubleshooting notebook and yes, you were right I used the wrong link. I updated the troubleshooting notebook I posted earlier. There you can see in the section ISSUE num.2 that TabTransformer can work without categorical features.

As you are working with private/proprietary data, I would suggest the following:

  1. in the code you may see that if you set objective to binary the los default to BCEWithLogitsLoss, i.e. here and here
    • this is a same loss as use used in TabNet model in in the issue you resolved earlier
  2. Try to use import pdb; pdb.set_trace() inside the trainer.fit() to debug what ground truth and predicted values are you sending to loss function, e.g. use pdb.set_trace() here or here
  3. maybe the model has na/infinity/? values on the output, so try to (i)normalize the columns by some Scaler in scikit-learn or just by enabling scaling in Preprocessor, i.e. parameter here; (ii) use different initializer than XavierNormal

Note: Please let me know if any of this helped. It could help other people, including us, if we come across the same issue.

jrzaurin commented 10 months ago

hi @taokz , here is a full functional code using a dataset with ALL continuous cols. Maybe you could use this as a starting point to fix the issue you are experiencing:

import numpy as np

from pytorch_widedeep import Trainer
from pytorch_widedeep.models import TabTransformer, WideDeep
from pytorch_widedeep.datasets import load_california_housing
from pytorch_widedeep.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
)
from pytorch_widedeep.preprocessing import TabPreprocessor

if __name__ == "__main__":

    df = load_california_housing(as_frame=True)
    df["location_x"] = np.cos(df.Latitude) * np.cos(
        df.Longitude
    )
    df["location_y"] = np.cos(df.Longitude) * np.sin(
        df.Longitude
    )
    df.drop(["Latitude", "Longitude"], axis=1, inplace=True)

    target_col = "MedHouseVal"
    target = df[target_col].values
    continuous_cols = [c for c in df.columns if c != target_col]
    tab_preprocessor = TabPreprocessor(
        continuous_cols=continuous_cols,
        cols_to_scale=continuous_cols,
        for_transformer=True,
    )
    X_tab = tab_preprocessor.fit_transform(df)

    tab_transformer = TabTransformer(
        column_idx=tab_preprocessor.column_idx,
        continuous_cols=continuous_cols,
        embed_continuous=True,
        input_dim=8,
        n_blocks=1,
        n_heads=2,
    )
    model = WideDeep(deeptabular=tab_transformer)

    callbacks = [
        EarlyStopping(patience=2),
        ModelCheckpoint(filepath="model_weights/wd_out"),
    ]
    trainer = Trainer(
        model,
        objective="regression",
        callbacks=callbacks,
    )
    trainer.fit(
        X_tab=X_tab,
        target=target,
        n_epochs=10,
        batch_size=128,
        val_split=0.2,
    )
taokz commented 10 months ago

@5uperpalo @jrzaurin Thank you for your detailed guidance! I am recently focusing on the other project, so I did not response in time. I appreciate your time and efforts!