reczoo / FuxiCTR

A configurable, tunable, and reproducible library for CTR prediction https://fuxictr.github.io
Apache License 2.0
934 stars 159 forks source link

在复现TransAct的时候验证数据经过embedding映射后全变成了NaN #101

Closed XiaoLongtaoo closed 3 months ago

XiaoLongtaoo commented 3 months ago

具体的问题代码定位在这里

Line157        X = self.get_inputs(inputs)
Line158        feature_emb_dict = self.embedding_layer(X)

对于训练阶段没有出现任何问题,但是在评估阶段对于验证数据经过这个embedding_layer后全变成了NaN,经过验证embedding_layer的所有权重全都是NaN,请问这里是哪里出现了问题呢(验证数据也有效加载进来了,X这里对于验证集是正常的)?

2024-07-21 15:36:35,644 P7079 INFO Evaluation @epoch 1 - batch 1: 
                                                                                                                                                                                                                                   Warning: NaN value detected in the weights of embedding_layers.userid.weight in the embedding layer                                                                                                            | 0/1 [00:00<?, ?it/s]
Warning: NaN value detected in the weights of embedding_layers.adgroup_id.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.pid.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.cate_id.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.campaign_id.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.customer.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.brand.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.cms_segid.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.cms_group_id.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.final_gender_code.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.age_level.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.pvalue_level.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.shopping_level.weight in the embedding layer
Warning: NaN value detected in the weights of embedding_layers.occupation.weight in the embedding layer
zhujiem commented 3 months ago

It looks like a configuration error. Could you post your configuration file?

XiaoLongtaoo commented 3 months ago

这是我的数据文件配置:

tiny_seq:
    data_root: ../../data/
    data_format: npz
    train_data: ../../data/tiny_seq/train.npz
    valid_data: ../../data/tiny_seq/valid.npz
    test_data: ../../data/tiny_seq/test.npz

模型配置文件:

Base:
    model_root: './checkpoints/'
    num_workers: 3
    verbose: 1
    early_stop_patience: 2
    pickle_feature_encoder: True
    save_best_only: True
    eval_steps: null
    debug_mode: False
    group_id: null
    use_features: null
    feature_specs: null
    feature_config: null

TransAct_default: # This is a config template
    model: TransAct
    dataset_id: TBD
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    learning_rate: 1.0e-3
    embedding_regularizer: 0
    net_regularizer: 0
    batch_size: 10000
    embedding_dim: 64
    hidden_activations: relu
    dcn_cross_layers: 3
    dcn_hidden_units: [1024, 512, 256]
    mlp_hidden_units: []
    num_heads: 1
    transformer_layers: 1
    transformer_dropout: 0
    dim_feedforward: 512
    net_dropout: 0
    target_item_field: adgroup_id
    sequence_item_field: click_sequence
    first_k_cols: 1
    use_time_window_mask: False
    time_window_ms: 86400000
    concat_max_pool: True
    batch_norm: False
    epochs: 100
    shuffle: True
    seed: 20242025
    monitor: {'AUC': 1, 'logloss': -1}
    monitor_mode: 'max'

TransAct_test:
    model: TransAct
    dataset_id: tiny_seq
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    learning_rate: 1.0e-3
    embedding_regularizer: 0
    net_regularizer: 0
    batch_size: 128
    embedding_dim: 4
    hidden_activations: relu
    dcn_cross_layers: 3
    dcn_hidden_units: [64, 32]
    mlp_hidden_units: []
    num_heads: 1
    transformer_layers: 1
    transformer_dropout: 0
    dim_feedforward: 512
    net_dropout: 0
    target_item_field: adgroup_id
    sequence_item_field: click_sequence
    first_k_cols: 1
    use_time_window_mask: False
    time_window_ms: 86400000
    concat_max_pool: True
    batch_norm: False
    epochs: 1
    shuffle: False
    seed: 20242025
    monitor: {'AUC': 1, 'logloss': -1}
    monitor_mode: 'max'
zhujiem commented 3 months ago

你直接跑的test 有问题?

zhujiem commented 3 months ago

Thanks for reporting the issue. It occurs due to the whole sequence is masked due to the lack of behavior data in this demo data. You can fix it by modifing https://github.com/reczoo/FuxiCTR/blob/main/model_zoo/TransAct/src/TransAct.py#L223

     key_padding_mask = self.adjust_mask(mask) # ensure that not all tokens are masked